Re: [R] missing values in logistic regression
(Ted Harding) wrote: On 29-Oct-04 Avril Coghlan wrote: Dear R help list, I am trying to do a logistic regression where I have a categorical response variable Y and two numerical predictors X1 and X2. There are quite a lot of missing values for predictor X2. eg., Y X1 X2 red 0.6 0.2* red 0.5 0.2* red 0.5 NA red 0.5 NA green 0.2 0.1* green 0.1 NA green 0.1 NA green 0.05 0.05 * I am wondering can I combine X1 and X2 in a logistic regression to predict Y, using all the data for X1, even though there are NAs in the X2 data? Or do I have to take only the cases for which there is data for both X1 and X2? (marked with *s above) I don't know of any R routine directly aimed at logistic regression with missing values as you describe. The aregImpute function in the Hmisc package can handle this, using predictive mean matching with weighted multinomial sampling of donor observations' binary covariate values. . . .. Ted. E-Mail: (Ted Harding) [EMAIL PROTECTED] -- Frank E Harrell Jr Professor and Chair School of Medicine Department of Biostatistics Vanderbilt University __ [EMAIL PROTECTED] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Re: [R] time dependency of Cox regression
array chip wrote: Hi, How can I specify a Cox proportional hazards model with a covariate which i believe its strength on survival changes/diminishes with time? The value of the covariate was only recorded once at the beginning of the study for each individual (e.g. at the diagnosis of the disease), so I do not have the time course data of the covariate for any given individual. For example, I want to state at the end of the analysis that the hazard ratio of the covariate is 6 at the beginning, decrease to 3 after 2 years and decrease to 1.5 after 5 years. Is this co-called time-dependent covariate? I guess not, because it's really about the influence of the covariate (which was measured once at the beginning) on survival changing over time. Thanks for any input. You might try a log-normal or log-logistic accelerated failure time model. These models dictate decreasing hazard ratios over time for baseline covariates. -- Frank E Harrell Jr Professor and Chair School of Medicine Department of Biostatistics Vanderbilt University __ [EMAIL PROTECTED] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Re: [R] OLS error
DrakeGis wrote: Thanks for the help. -I'm really sorry, but I'm affraid I can't publish any data in order to allow a reproduction of the results (enterprise policies :( ). That is silly. You can surely simulate data that provides an example of the problem you are having. Frank Harrell __ [EMAIL PROTECTED] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Re: [R] an off-topic question - model validation
Wensui Liu wrote: Currently, I am working on a data mining project and plan to divide the data table into 2 parts, one for modeling and the other for validation to compare several models. But I am not sure about the percentage of data I should use to build the model and the one I should keep to validate the model. Is there any literature reference about this topic? Thank you so much! Data splitting is very inefficient for model validation unless the sample size is extremely large. Consider using Efron's optimism bootstrap as is used in the validate function in the Design package. validate will also do data splitting and cross-validation though. -- Frank E Harrell Jr Professor and Chair School of Medicine Department of Biostatistics Vanderbilt University __ [EMAIL PROTECTED] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Re: [R] Barplot difficulties
Heather J. Branton wrote: Hello. I am an R newbie struggling to learn and use R . I have read many portions of the R Reference Manual, as well as the FAQs. Given that I learn something new each time, I know I might be missing something obvious. But I appeal to your good nature to help me through this initial problem. I have attached a pdf file to demonstrate what I desire and have listed what my data looks like in Excel (below). Following is the data and script I developed - which does not provide what I want. Dear Heather - Please read Bill Cleveland's book The Elements of Graphing Data. A MUCH better plot can be produced. And let time be one of the first variables to vary. -- Frank E Harrell Jr Professor and Chair School of Medicine Department of Biostatistics Vanderbilt University __ [EMAIL PROTECTED] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Re: [R] Adding mean and SEM in Hmisc ecdf
Jean-Louis Abitbol wrote: Dear R Gurus, Sorry if this has been asked before but I did not find it in the archives. I would like to add a horizontal display of mean and SEM on Hmisc ecdf plots done by group (ie variate~treatment). Has anyone written some code to do that ? Thanks and kind regards, Jean-Louis It is customary to show quantiles on ecdfs, and the functions make that easy. You can add other points or reference lines to an existing plot easily if using the non-lattice ecdf. To do it with the lattice version ecdf.formula you will have to change its default panel function. -- Frank E Harrell Jr Professor and Chair School of Medicine Department of Biostatistics Vanderbilt University __ [EMAIL PROTECTED] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Re: [R-gui] Re: [R] The hidden costs of GPL software?
Patrick Burns wrote: I'm a big advocate -- perhaps even fanatic -- of making R easier for novices in order to spread its use, but I'm not convinced that a GUI (at least in the traditional form) is the most valuable approach. Perhaps an overly harsh summary of some of Ted Harding's statements is: You can make a truck easier to get into by taking off the wheels, but that doesn't make it more useful. In terms of GUIs, I think what R should focus on is the ability for user's to make their own specialized GUI. So that a knowledgeable programmer at an installation can create a system that is easy for unsophisticated users for the limited number of tasks that are to be done. The ultimate users may not even need to know that R exists. I think Ted Harding was on the mark when he said that it is the help system that needs enhancement. I can imagine a system that gets the user to the right function and then helps fill in the arguments; all of the time pointing them towards the command line rather than away from it. The author of the referenced article highlighted some hidden costs of R, but did not highlight the hidden benefits (because they were hidden from him). A big benefit of R is all of the bugs that aren't in it (which may or may not be due to its free status). Patrick Burns Burns Statistics [EMAIL PROTECTED] +44 (0)20 8525 0696 http://www.burns-stat.com (home of S Poetry and A Guide for the Unwilling S User) Jan P. Smit wrote: Dear Phillippe, Very interesting. The URL of the article is http://www.scientific-computing.com/scwsepoct04free_statistics.html. Best regards, Jan Smit Philippe Grosjean wrote: Hello, In the latest 'Scientific Computing World' magazine (issue 78, p. 22), there is a review on free statistical software by Felix Grant (doesn't have to pay good money to obtain good statistics software). As far as I know, this is the first time that R is even mentioned in this magazine, given that it usually discuss commercial products. [ ...] I really agree with you Patrick. To me the keys are having better help search capabilities, linking help files to case studies or at least detailed examples, having a navigator by keywords (a rudimentary one is at http://biostat.mc.vanderbilt.edu/s/finder/finder.html), having a great library of examples keyed by statistical goals (a la BUGS examples guides), and having a menu-driven skeleton code generator that gives beginners a starting script to edit to use their variable names, etc. Also I think we need a discussion board that has a better memory for new users, like some of the user forums currently on the web, or using a wiki. Frank -- Frank E Harrell Jr Professor and Chair School of Medicine Department of Biostatistics Vanderbilt University __ [EMAIL PROTECTED] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Re: [R] The hidden costs of GPL software?
Mike Prager wrote: At 11/18/2004 07:01 AM Thursday, Thomas Schönhoff wrote: To sum up, what I am in need to is an extensive example based help-system, focused on how to do things in R. In parts this is already there, i.e. SimpleR from Verzani (contributed docs area) etc. Hopefully I can contribute to this in future, since it is seems to me invaluable to learn R by going through example-based lessons (some are found in vignette() ). These are much more comprehensible to me than those short reference like entries in the current help-system, mostly due to their very technical approach (same is to be said about the official GNU R manuals, especially The R Language, which wasn't a great help for me when I took my first look at GNU R). In this context something like the GuideMaps of Vista come to my mind! But to be as clear as possible, I think GNU R is great and I appreciate all the efforts done by the R core team and associates! Nevertheless it seems to be valuable to re-think the help-system in R with respect to those who may have a good understanding in statistics, but lacking some basic experiences in how to introduce themselves to sophisticated world of R/S languages. (I posted similar material before, but it was moved to R-devel, and I wanted to express a bit of it here.) I have frequently felt, like Thomas, that what could make R easier to use is not a GUI, but a help system more focused on tasks and examples, rather than on functions and packages. This has obvious and large costs of development, and I am unlikely to contribute much myself, for reasons of time and ability. Yet, I mention it for the sake of this discussion. Such a help system could be a tree (or key) structure in which through making choices, the user's description of the desired task is gradually narrowed. At the end of each twig of the tree would be a list of suggested functions for solving the problem, hyperlinked into the existing help system (which in many ways is outstanding and has evolved just as fast as R itself). This could be coupled with the continued expansion of the number of examples in the help system. Now I must express appreciation for what exists already that helps in this regard: MASS (in its many editions), Introductory Statistics with R, Simple R, and the other free documentation that so many authors have generously provided. Not to mention the superlative contribution of R itself, and the work of the R development team. It is beyond my understanding how something so valuable and well thought out has been created by people with so many other responsibilities. Mike ... I second all of that. What you are describing Mike could be done with a community-maintained wiki, with easy to add hyperlinks to other sites. Just think what a great value it would be to the statistical community to have an ever-growing set of examples with all code and output, taking a cue from the BUGS examples guides. The content could be broken down by major areas (data import examples, data manipulation examples, many analysis topics, many graphics topics, etc.). Ultimately the more elaborate case studies could be peer-reviewied (a la the Journal of Statistical Software) and updated. Frank -- Frank E Harrell Jr Professor and Chair School of Medicine Department of Biostatistics Vanderbilt University __ [EMAIL PROTECTED] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Re: [R] SAS or R software
neela v wrote: Hi all there Can some one clarify me on this issue, features wise which is better R or SAS, leaving the commerical aspect associated with it. I suppose there are few people who have worked on both R and SAS and wish they would be able to help me in deciding on this. THank you for the help I estimate that SAS is about 11 years behind R in statistical analysis and graphics capabilities, and that gap is growing. Also, code in SAS is not peer-reviewed as is code in R. SAS has advantages in two areas: dealing with massive datasets (generally speaking, 1GB) and getting more accurate P-values in mixed effect models. See http://biostat.mc.vanderbilt.edu/twiki/pub/Main/RS/sintro.pdf for one comparison of SAS and S on technical features. There are companies whose yearly license fees to SAS total millions of dollars. Then those companies hire armies of SAS programmers to program an archaic macro language using old statistical methods to produce ugly tables and the worst graphics in the statistical software world. Frank Harrell SAS User, 1969-1991 -- Frank E Harrell Jr Professor and Chair School of Medicine Department of Biostatistics Vanderbilt University __ [EMAIL PROTECTED] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
[R] Error with strwidth after lattice graphic drawn
In platform i386-pc-linux-gnu arch i386 os linux-gnu system i386, linux-gnu status major2 minor0.1 year 2004 month11 day 15 language R I'm getting an error when using strwidth after a lattice graphic is drawn: library(lattice) xyplot(runif(20) ~ runif(20)) strwidth('xxx') Error in strwidth(xxx) : invalid graphics state Any help appreciated. I have version 2.0.1 of grid and version 0.10-14 of lattice. -- Frank E Harrell Jr Professor and Chair School of Medicine Department of Biostatistics Vanderbilt University __ [EMAIL PROTECTED] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Re: [R] SAS or R software
BXC (Bendix Carstensen) wrote: Two major advantages of SAS that seems to have been overlooked in the previous replies are: 1) The data-set language is SAS for data manipulation is more human-readable than R-code in general. R is not a definite write-only laguage as APL, but in particular in datamanipulation it is easy to write code that is impossible to decipher after few weeks. You can also produce unreadable code in SAS, but it generally takes more of an effort. Thus: Data manupulation is easier to document in SAS than in R. I agree with the readability of data manipulation code, especially for novice users. As far as functionality for data manipulation is concerned, R has more flexibility and is faster to program once one has experience. I do think that learning data manipulation techniques in R takes a while. 2) proc tabulate. This procedure enables you to do extensive sensible tabulation of your data if you are prepared to read the manual. (This is not a product of the complexity of the software, but of the complexity of the tabulation features). Compared to this only rudimentay tools exist in R (afaik). I could not disagree more with that statement. Look for example at http://biostat.mc.vanderbilt.edu/twiki/pub/Main/StatReport/summary.pdf . With R you can customize tables by specifying your own function for computing any statistic of interest. Let's see how to use PROC TABULATE to display stratified Kaplan-Meier 4-year survival estimates along with median and median life length, not to mention match the fine control of formatting available using the combination of R and LaTeX. So if you want to do well documented data manipulation and clear and compact tables go for SAS. Readability is different from being well-documented. And for clear and compact tables, R is the winner hands down. Frank Harrell If you want to do statistical analyses and graphics (in finite time) go for R. Bendix Carstensen -- Bendix Carstensen Senior Statistician -- Frank E Harrell Jr Professor and Chair School of Medicine Department of Biostatistics Vanderbilt University __ [EMAIL PROTECTED] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Re: [R] Help with ooplot(gplots) and error bars
Jean-Louis Abitbol wrote: Dear All I am trying to graph a proportion and CI95% by a factor with ooplot (any other better solution ?) It works well until I try to add the confidence interval. this is the error message and and a description of the data: dat1 PointEst TT1 1 3.6 TT2 2 5.0 TT3 3 5.8 TT4 4 11.5 TT5 5 7.5 TT5 6 8.7 TT7 7 17.4 dat2 Lower TT1 1 1.0 TT2 2 2.2 TT3 3 2.7 TT4 4 6.7 TT5 5 3.9 TT5 6 4.6 TT7 7 11.5 dat3 Upper TT1 1 12.3 TT2 2 11.2 TT3 3 12.1 TT4 4 19.1 TT5 5 14.2 TT5 6 15.6 TT7 7 25.6 ooplot(dat1,type=barplot,col=rich.colors(7,temperature),names.arg=c(X,Y,Z,A,B,C,D),plot.ci=T, + ci.l=dat2,ci.u=dat3, xlab=Treatment, ylab=Percent Normalized Patients) Error in ooplot.default(dat1, type = barplot, col = rich.colors(7, temperature), : 'height' and 'ci.u' must have the same dimensions. I have tried various ways of supplying ci.l and ci.u (including a vector) Thanks for the help that anyone can bring, Regards, JL One way is to look at the examples for Dotplot in the Hmisc package. Those examples display bootstrap percentile confidence intervals for a mean. -- Frank E Harrell Jr Professor and Chair School of Medicine Department of Biostatistics Vanderbilt University __ [EMAIL PROTECTED] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Re: [R] Help with ooplot(gplots) and error bars
Jean-Louis Abitbol wrote: Dear Pr Harrel, Thanks for your help at this occasion as well as for previous questions to the list. I just looked at the example in your intro doc. However I am dealing with proportions (ie % of patients responding to a given treatment). In this case I am not sure I can use summarize and then the xYplot. I am not aware of R graphing tools that can deal directly with proportions adding CI not to mention producing by factor/trellis plots. This is why I why trying to do it by hand (using binconf) with ooplot, without much success I am afraid. Best regards, JL Abitbol, MD Jean-Louis, Here is an example. # Plot proportions and their Wilson confidence limits set.seed(3) d - expand.grid(continent=c('USA','Europe'), year=1999:2001, reps=1:100) # Generate binary events from a population probability of 0.2 # of the event, same for all years and continents d$y - ifelse(runif(6*100) = .2, 1, 0) s - with(d, summarize(y, llist(continent,year), function(y) { n - sum(!is.na(y)) s - sum(y, na.rm=T) binconf(s, n) }, type='matrix') ) Dotplot(year ~ Cbind(y) | continent, data=s, ylab='Year', xlab='Probability') I did have to temporarily override a function in Hmisc to fix a problem. This will be corrected in an upcoming release of Hmisc: mApply - function(X, INDEX, FUN=NULL, ..., simplify=TRUE) { ## Matrix tapply ## X: matrix with n rows; INDEX: vector or list of vectors of length n ## FUN: function to operate on submatrices of x by INDEX ## ...: arguments to FUN; simplify: see sapply ## Modification of code by Tony Plate [EMAIL PROTECTED] 10Oct02 ## If FUN returns more than one number, mApply returns a matrix with ## rows corresponding to unique values of INDEX nr - nrow(X) if(!length(nr)) { ## X not a matrix r - tapply(X, INDEX, FUN, ..., simplify=simplify) if(is.matrix(r)) r - drop(t(r)) else if(simplify is.list(r)) r - drop(matrix(unlist(r), nrow=length(r), dimnames=list(names(r),names(r[[1]])), byrow=TRUE)) } else { idx.list - tapply(1:nr, INDEX, c) r - sapply(idx.list, function(idx,x,fun,...) fun(x[idx,,drop=FALSE],...), x=X, fun=FUN, ..., simplify=simplify) if(simplify) r - drop(t(r)) } dn - dimnames(r) if(length(dn) !length(dn[[length(dn)]])) { fx - FUN(X,...) dnl - if(length(names(fx))) names(fx) else dimnames(fx)[[2]] dn[[length(dn)]] - dnl dimnames(r) - dn } if(simplify is.list(r) is.array(r)) { ll - sapply(r, length) maxl - max(ll) empty - (1:length(ll))[ll==0] for(i in empty) r[[i]] - rep(NA, maxl) ## unlist not keep place for NULL entries for nonexistent categories first.not.empty - ((1:length(ll))[ll 0])[1] nam - names(r[[first.not.empty]]) dr - dim(r) r - aperm(array(unlist(r), dim=c(maxl,dr), dimnames=c(list(nam),dimnames(r))), c(1+seq(length(dr)), 1)) } r } Frank On Sun, 21 Nov 2004 07:48:58 -0500, Frank E Harrell Jr [EMAIL PROTECTED] said: Jean-Louis Abitbol wrote: Dear All I am trying to graph a proportion and CI95% by a factor with ooplot (any other better solution ?) It works well until I try to add the confidence interval. this is the error message and and a description of the data: dat1 PointEst TT1 1 3.6 TT2 2 5.0 TT3 3 5.8 TT4 4 11.5 TT5 5 7.5 TT5 6 8.7 TT7 7 17.4 dat2 Lower TT1 1 1.0 TT2 2 2.2 TT3 3 2.7 TT4 4 6.7 TT5 5 3.9 TT5 6 4.6 TT7 7 11.5 dat3 Upper TT1 1 12.3 TT2 2 11.2 TT3 3 12.1 TT4 4 19.1 TT5 5 14.2 TT5 6 15.6 TT7 7 25.6 ooplot(dat1,type=barplot,col=rich.colors(7,temperature),names.arg=c(X,Y,Z,A,B,C,D),plot.ci=T, + ci.l=dat2,ci.u=dat3, xlab=Treatment, ylab=Percent Normalized Patients) Error in ooplot.default(dat1, type = barplot, col = rich.colors(7, temperature), : 'height' and 'ci.u' must have the same dimensions. I have tried various ways of supplying ci.l and ci.u (including a vector) Thanks for the help that anyone can bring, Regards, JL One way is to look at the examples for Dotplot in the Hmisc package. Those examples display bootstrap percentile confidence intervals for a mean. -- Frank E Harrell Jr Professor and Chair School of Medicine Department of Biostatistics Vanderbilt University __ [EMAIL PROTECTED] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Re: [R] Error with strwidth after lattice graphic drawn
Uwe Ligges wrote: Prof Brian Ripley wrote: On Sat, 20 Nov 2004, Frank E Harrell Jr wrote: In platform i386-pc-linux-gnu arch i386 os linux-gnu system i386, linux-gnu status major2 minor0.1 year 2004 month11 day 15 language R I'm getting an error when using strwidth after a lattice graphic is drawn: library(lattice) xyplot(runif(20) ~ runif(20)) strwidth('xxx') Error in strwidth(xxx) : invalid graphics state Any help appreciated. I have version 2.0.1 of grid and version 0.10-14 of lattice. The advice is `don't do that'! strwidth() is a base graphics command, and will only work if a device is currently plotting base graphics. Lattice is built on grid, which has stringWidth(). ... and convertWidth() is useful to display stuff afterwards in an interpretable way ... Uwe Ligges Thanks Uwe and Brian. Before the latest versions, grid would let me use ordinary graphics functions to deal with the currently rendered plot after I did par(usr=c(0,1,0,1)), so I was lazy. I changed my code to use specific grid functions. -- Frank E Harrell Jr Professor and Chair School of Medicine Department of Biostatistics Vanderbilt University __ [EMAIL PROTECTED] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Re: [R] SAS or R software
Michael Friendly wrote: I can't resist dipping my oar in here: For me, some significant advantages of SAS are - Ability to input data in almost *any* conceivable form using the combination of features available through input/infile statements, SAS informats and formats, data step programming, etc. Dataset manipulation (merge, join, stack, subset, summarize, etc.) also favors SAS in more complex cases, IMHO. I respectfully disagree Michael, in cases where the file sizes are not enormous. And in many cases repetitive SAS data manipulation can be done much easier using vectors (e.g., vectors of recodes) and matrices in R, as shown in the examples in Alzola Harrell (http://biostat.mc.vanderbilt.edu/twiki/pub/Main/RS/sintro.pdf). We are working on a project in which a client sends us 50 SAS datasets. Not only can we do all the complex data manipulations needed in R, but we can treat the 50 datasets as a array (actually a list( )) to handle repetitive operations over many datasets. - Output delivery system (ODS): *Every* piece of SAS output is an output object that can be captured as a dataset, rendered in RTF, LaTeX, HTML, PDF, etc. with a relatively simple mechanism (including graphs) ods pdf file='mystuff.pdf''; any sas stuff ods pdf close; - If you think the output tables are ugly, it is not difficult to define a template for *any* output to display it the way you like. I would like to see SAS ODS duplicate Table 10 in http://biostat.mc.vanderbilt.edu/twiki/pub/Main/StatReport/summary.pdf Cheers, Frank - ODS Graphics (new in SAS 9.1) will extend much of this so that statistical procedures will produce many graphics themselves with ease. One significant disadvantage for me is the difficulty of composing multipanel graphic displays (trellis graphics, linked micromaps, etc.) due to the lack of an overall, top-down graphics environment. As well, there are a variety of kinds of graphics I've found extraordinarily frustrating to try to do in SAS because of lack of coherence or generality in the output available from procedures --- an example would be effect displays, such as implemented in R in the effects package. I can't agree, however, with Frank Harrell that SAS produces 'the worst graphics in the statistical software world.' One can get ugly graphs in R almost as easily in SAS just by accepting the 80-20 rule: You can typically get 80% of what you want with 20% of the effort. To get what you really want takes the remaining 80% of effort. On the other hand, the active hard work of many R developers has led to R graphics for which the *default* results for many graphs avoid many of the egregious design errors introduced in SAS in the days of pen-plotters (+ signs for points, cross-hatching for area fill). -Michael -- Frank E Harrell Jr Professor and Chair School of Medicine Department of Biostatistics Vanderbilt University __ [EMAIL PROTECTED] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Re: [R] Weibull survival regression
Eric Lim wrote: Dear R users, Please can you help me with a relatively straightforward problem that I am struggling with? I am simply trying to plot a baseline survivor and hazard function for a simple data set of lung cancer survival where `futime' is follow up time in months and status is 1=dead and 0=alive. Using the survival package: lung.wbs - survreg( Surv(futime, status)~ 1, data=lung, dist='weibull') plot (lung.wbs) Returns the error msg: Error in xy.coords(x, y, xlabel, ylabel, log) : x and y lengths differ Using the Design package: lung.wbd - psm (Surv (futime, status)~ 1, dist=weibull, data=lung, na.action=na.omit) You don't need to specify na.action here. survplot(lung.wbd) Returns the error msg: Error in survplot.Design(lung.wbd) : fit does not have design information survplot only works when there is at least one covariate. Sorry. Maybe someday ... -Frank Harrell Regards, Eric Lim Papworth Hospital Cambridge, UK -- Frank E Harrell Jr Professor and Chair School of Medicine Department of Biostatistics Vanderbilt University __ [EMAIL PROTECTED] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Re: [R] number of pairwise present data in matrix with missings
Andreas Wolf wrote: is there a smart way of determining the number of pairwise present data in a data matrix with missings (maybe as a by-product of some statistical function?) so far, i used several loops like: for (column1 in 1:99) { for (column2 in 2:100) { for (row in 1:500) { if (!is.na(matrix[row,column1]) !is.na(matrix[row,column2])) { pairs[col1,col2] - pairs[col1,col2]+1 } } } } but this seems neither the most elegant nor an utterly fast solution. thanks for suggestions. andreas wolf library(Hmisc) n - naclus(mydataframe) plot(n) # show pairwise missingness in a dendogram naplot(n) # show more details in multiple plots Frank -- Frank E Harrell Jr Professor and Chair School of Medicine Department of Biostatistics Vanderbilt University __ [EMAIL PROTECTED] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Re: [R] R vs SPSS
Vito Ricci wrote: Dear all, in last weeks you discussed about R vs SAS. I want to ask your opinion about a comparison between R and SPSS. I don't know this software, but some weeks ago I went to a presentation of this product. I found it really user-friendly with GUI (even if I'd prefer command line) and very usefull and simple to use in creation and managing tables, OLAP tecniques, pivot table. What you think about? Cordially Vito What worries me about SPSS is that it often results in poor statistical practice. The defaults in dialog boxes are not very good in some cases, and like SAS, SPSS tends to lead users to make to many assumptions (linearity in regression being one of the key ones). -- Frank E Harrell Jr Professor and Chair School of Medicine Department of Biostatistics Vanderbilt University __ [EMAIL PROTECTED] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Re: [R] problem with using transace
[EMAIL PROTECTED] wrote: I am trying to use the Hmisc function transace to transform predictors test-cbind(flowstress,pressres,alloy) xtrans-transace(x,binary=pressres',monotonic='flowstress', categorical='alloy') and I am getting the following message¨ Error in ace(x[, -i], x[, i], monotone = im, categorical = ic) : unused argument(s) (monotone ...) Any idea? thanks anne thank for your help Anne The corrected version (below) will fix that problem but note that there is a bug in ace causing it not to allow a monotonicity constraint when a variable is on the left hand side. This is inconsistent with the ace documentation. There are other problems in ace in which it checks column numbers against the number of rows in the x matrix instead of the number of columns. The internal version of ace defined inside areg.boot fixes the latter problem. Note that I reported these problems a long time ago. Frank transace - function(x, monotonic=NULL, categorical=NULL, binary=NULL, pl=TRUE) { if(.R.) require('acepack') # provides ace, avas nam - dimnames(x)[[2]] omit - is.na(x %*% rep(1,ncol(x))) omitted - (1:nrow(x))[omit] if(length(omitted)) x - x[!omit,] p - ncol(x) xt - x # binary variables retain original coding if(!length(nam)) stop(x must have column names) rsq - rep(NA, p) names(rsq) - nam for(i in (1:p)[!(nam %in% binary)]) { lab - nam[-i] w - 1:(p-1) im - w[lab %in% monotonic] ic - w[lab %in% categorical] if(nam[i] %in% monotonic) im - c(0, im) if(nam[i] %in% categorical) ic - c(0, ic) m - 10*(length(im)0)+(length(ic)0) if(m==11) a - ace(x[,-i], x[,i], mon=im, cat=ic) else if (m==10) a - ace(x[,-i], x[,i], mon=im) else if(m==1) a - ace(x[,-i], x[,i], cat=ic) else a - ace(x[,-i], x[,i]) xt[,i] - a$ty rsq[i] - a$rsq if(pl)plot(x[,i], xt[,i], xlab=nam[i], ylab=paste(Transformed,nam[i])) } cat(R-squared achieved in predicting each variable:\n\n) print(rsq) attr(xt, rsq) - rsq attr(xt, omitted) - omitted invisible(xt) } __ [EMAIL PROTECTED] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html -- Frank E Harrell Jr Professor and Chair School of Medicine Department of Biostatistics Vanderbilt University __ [EMAIL PROTECTED] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Re: [R] impute missing values in correlated variables: transcan?
Jonathan Baron wrote: I would like to impute missing data in a set of correlated variables (columns of a matrix). It looks like transcan() from Hmisc is roughly what I want. It says, transcan automatically transforms continuous and categorical variables to have maximum correlation with the best linear combination of the other variables. And, By default, transcan imputes NAs with best guess expected values of transformed variables, back transformed to the original scale. But I can't get it to work. I say m1 - matrix(1:20+rnorm(20),5,) # four correlated variables colnames(m1) - paste(R,1:4,sep=) m1[c(2,19)] - NA# simulate some missing data library(Hmisc) transcan(m1,data=m1) and I get Error in rcspline.eval(y, nk = nk, inclx = TRUE) : fewer than 6 non-missing observations with knots omitted Jonathan - you would need many more observations to be able to fit flexible additive models as transcan does. Also note that single imputation has problems and you may want to consider multiple imputation as done by the Hmisc aregImpute function, if you had more data. Frank I've tried a few other things, but I think it is time to ask for help. The specific problem is a real one. Our graduate admissions committee (4 members) rates applications, and we average the ratings to get an overall rating for each applicant. Sometimes one of the committee members is absent, or late; hence the missing data. The members differ in the way they use the rating scale, in both slope and intercept (if you regress each on the mean). Many decisions end up depending on the second decimal place of the averages, so we want to do better than just averging the non-missing ratings. Maybe I'm just not seeing something really simple. In fact, the problem is simpler than transcan assumes, since we are willing to assume linearity of the regression of each variable on the other variables. Other members proposed solutions that assumed this, but they did not take into account the fact that missing data at the high or low end of each variable (each member's ratings) would change its mean. Jon -- Frank E Harrell Jr Professor and Chair School of Medicine Department of Biostatistics Vanderbilt University __ [EMAIL PROTECTED] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
[R] [R-pkgs] New Hmisc Package Available
An updated version of Hmisc is now available from CRAN. The Web site for Hmisc is http://biostat.mc.vanderbilt.edu/s/Hmisc. The change log may be found at http://biostat.mc.vanderbilt.edu/changelog/Hmisc.html. Changes made after 2004-11-24 should be ignored; these will be in the next version. The most major change in Hmisc is that thanks to discussions with a highly respected, persistent, and convincing member of the R community, Hmisc no longer overrides [.factor to drop unused factor levels by default upon subsetting. You can get the old behavior at any time by typing dropUnusedLevels(). A message to that effect appears when you attach the package. Also Hmisc no longer overrides the interaction function, as the R builtin version provides the sep argument. In the next version, the summary.formula function will be updated to automatically drop unused levels when this is appropriate. Some other key changes are the addition of bubble plot capability in xYplot and addition of the capability of csv.get to correct certain date formatting inconsistencies. sasxport.get now has keep and drop arguments to restrict attention to a subset of datasets when many datasets are being imported. Other changes include bug and documentation fixes. -- Frank E Harrell Jr Professor and Chair School of Medicine Department of Biostatistics Vanderbilt University ___ R-packages mailing list [EMAIL PROTECTED] https://stat.ethz.ch/mailman/listinfo/r-packages __ [EMAIL PROTECTED] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Re: [R] What is the most useful way to detect nonlinearity in lo
(Ted Harding) wrote: On 05-Dec-04 Patrick Foley wrote: It is easy to spot response nonlinearity in normal linear models using plot(something.lm). However plot(something.glm) produces artifactual peculiarities since the diagnostic residuals are constrained by the fact that y can only take values 0 or 1. What do R users find most useful in checking the linearity assumption of logistic regression (i.e. log-odds =a+bx)? Patrick Foley [EMAIL PROTECTED] The most useful way to detect nonlinearity in logistic regression is: a) have an awful lot of data b) have the x (covariate) values judiciously placed. Don't be optimistic about this prohlem. The amount of information, especially about non-linearity, in the binary responses is often a lot less than people intuitively expect. This is an area where R can be especially useful for self-education by exploring possibilities and simulation. For example, define the function (for quadratic nonlinearity): testlin2-function(a,b,N){ x-c(-1.0,-0.5,0.0,0.5,1.0) lp-a*x+b*x^2; p-exp(lp)/(1+exp(lp)) n-N*c(1,1,1,1,1) r-c(rbinom(1,n[1],p[1]),rbinom(1,n[2],p[2]), rbinom(1,n[3],p[3]),rbinom(1,n[4],p[4]), rbinom(1,n[5],p[5]) ) resp-cbind(r,n-r) X-cbind(x,x^2);colnames(X)-c(x,x2) summary(glm(formula = resp ~ X - 1, family = binomial),correlation=TRUE) } This places N observations at each of (-1.0,0.5,0.0.5,1.0), generates the N binary responses with probability p(x) where log(p/(1-p)) = a*x + b*x^2, fits a logistic regression forcing the intercept term to be 0 (so that you're not diluting the info by estimating a parameter you know to be 0), and returns the summary(glm(...)) from which the p-values can be extracted: The p-value for x^2 is testlin2(a,b,N)$coefficients[2,4]} You can run this function as a one-off for various values of a, b, N to get a feel for what happens. You can run a simulation on the lines of pvals-numeric(1000); for(i in (1:1000)){ pvals[i]-testlin2(1,0.1,500)$coefficients[2,4] } so that you can test how often you get a significant result. For example, adopting the ritual sigificant == P0.05, power = 80%, you can see a histogram of the p-values over the conventional significance breaks with hist(pvals,breaks=c(0,0.01,0.03,0.1,0.5,0.9,0.95,0.99,1),freq=TRUE) and you can see your probability of getting a significant result as e.g. sum(pvals 0.05)/1000 I found that, with testlin2(1,0.1,N), i.e. a = 1.0, b = 0.1 corresponding to log(p/(1-p)) = x + 0.1*x^2 (a possibly interesting degree of nonlinearity), I had to go up to N=2000 before I was getting more than 80% of the p-values 0.05. That corresponds to 2000 observations at each value of x, or 10,000 observations in all. Compare this with a similar test for non-linearity with normally-distributed responses [exercise for the reader]. You can write functions similar to testlin2 for higher-order nonlinearlities, e.g. testlin3 for a*x + b*x^3, testlin23 for a*x + b*x^2 + c*x^3, etc., (the modifications required are obvious) and see how you get on. As I say, don't be optimistic! In particular, run testlin3 a few times and see the sort of mess that can come out -- in particular gruesome correlations, which is why correlation=TRUE is set in the call to summary(glm(...),correlation=TRUE). Best wishes, Ted. library(Design) # also requires Hmisc f - lrm(sick ~ sex + rcs(age,4) + rcs(blood.pressure,5)) # Restricted cubic spline in age with 4 knots, blood.pressure with 5 anova(f) # automatic tests of linearity of all predictors latex(f) # see algebraic form of fit summary(f)# get odds ratios for meaningful changes in predictors But beware of using the tests of linearity. If non-significant results cause you to reduce an effect to linear, confidence levels and type I errors are no longer preserved. I use tests of linearity mainly to demonstrate that effects are more often nonlinear than linear, given sufficient sample size. I.e., good analysts are needed. I usually leave non-significant nonlinearities in the model. -- Frank E Harrell Jr Professor and Chair School of Medicine Department of Biostatistics Vanderbilt University __ [EMAIL PROTECTED] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Re: [R] Gam() function in R
Yves Magliulo wrote: so mgcv package is the one i need! indeed, i want integrated smoothness selection and smooth interactions rather than stepwise selection. i have a lot of predictor, and i use gam to select those who are efficient and exclude others. (using p-value) It is interesting that you use P-values but do not care that the strategy you use (variable selection as opposed to pre-specifying models or just using shrinkage) does not preserve type I error or confidence interval coverage probabilities in subsequent analyses with mgcv. -- Frank E Harrell Jr Professor and Chair School of Medicine Department of Biostatistics Vanderbilt University __ [EMAIL PROTECTED] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Re: [R] cbind() and factors.
Gabor Grothendieck wrote: michael watson (IAH-C michael.watson at bbsrc.ac.uk writes: : : Hi : : I'm seeing some odd behaviour with cbind(). My code is: : : cat - read.table(cogs_category.txt, sep=\t, header=TRUE, : quote=NULL, colClasses=character) : colnames(cat) : [1] CodeDescription : is.factor(cat$Code) : [1] FALSE : is.factor(cat$Description) : [1] FALSE : is.factor(rainbow(nrow(cat))) : [1] FALSE : cat - cbind(cat,Color=rainbow(nrow(cat))) : is.factor(cat$Color) : [1] TRUE : ?cbind : : I read a text file in which has two columns, Code and Description. : Neither of these are factors. I want to add a column of colours to the : data frame using rainbow(). The rainbow function also does not return a : factor. However, if I cbind my data frame (which has no factors in it) : and the results of rainbow() (which is a vector, not a factor), then for : some reason the new column is a factor...?? Others have already explained the problem and given what is likely the best solution but here is one other idea, just in case. You may require a data frame depending on what you want to do but if you don't then you could alternately use a character matrix since that won't result in any conversions to factor. Lets call the data frame from read.table, Cat.df, and our matrix, Cat.m. cat is not wrong but its confusing since there is a common R function called cat. Now we can write the following and don't have to worry about factors: Cat.df - read.table(...) # create a character matrix and cbind Colors to it Cat.m - cbind(as.matrix(Cat.df), Color = rainbow(nrow(Cat.df))) If you do find you need a data frame later you can convert it back like this: Cat.df - as.data.frame(Cat.m) Cat.df[] - Cat.m # clobber factors with character data For speed, the mApply function in the Hmisc package (used by the Hmisc summarize function) does looping for stratified statistical summaries by operating on matrices rather than data frames. factors are converted to numerics, and service routines can save and restore the levels and other attributes. Here is an example from the summarize help file, plus related examples: # To run mApply on a data frame: m - mApply(asNumericMatrix(x), race, h) # Here assume h is a function that returns a matrix similar to x at - subsAttr(x) # get original attributes and storage modes matrix2dataFrame(m, at) # Get stratified weighted means g - function(y) wtd.mean(y[,1],y[,2]) summarize(cbind(y, wts), llist(sex,race), g, stat.name='y') mApply(cbind(y,wts), llist(sex,race), g) # Compare speed of mApply vs. by for computing d - data.frame(sex=sample(c('female','male'),10,TRUE), country=sample(letters,10,TRUE), y1=runif(10), y2=runif(10)) g - function(x) { y - c(median(x[,'y1']-x[,'y2']), med.sum =median(x[,'y1']+x[,'y2'])) names(y) - c('med.diff','med.sum') y } system.time(by(d, llist(sex=d$sex,country=d$country), g)) system.time({ x - asNumericMatrix(d) a - subsAttr(d) m - mApply(x, llist(sex=d$sex,country=d$country), g) }) system.time({ x - asNumericMatrix(d) summarize(x, llist(sex=d$sex, country=d$country), g) }) -- Frank E Harrell Jr Professor and Chair School of Medicine Department of Biostatistics Vanderbilt University __ [EMAIL PROTECTED] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Re: [R] how to fit a weighted logistic regression?
Kerry Bush wrote: I tried lrm in library(Design) but there is always some error message. Is this function really doing the weighted logistic regression as maximizing the following likelihood: \sum w_i*(y_i*\beta*x_i-log(1+exp(\beta*x_i))) Does anybody know a better way to fit this kind of model in R? FYI: one example of getting error message is like: x=runif(10,0,3) y=c(rep(0,5),rep(1,5)) w=rep(1/10,10) fit=lrm(y~x,weights=w) Warning message: currently weights are ignored in model validation and bootstrapping lrm fits in: lrm(y ~ x, weights = w) although the model can be fit, the above output warning makes me uncomfortable. Can anybody explain about it a little bit? The message means exactly what it says. Model validation in Design currently cannot incorporate weights for lrm. Everything else is OK. Frank Harrell Best wishes, Feixia __ [EMAIL PROTECTED] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html -- Frank E Harrell Jr Professor and Chair School of Medicine Department of Biostatistics Vanderbilt University __ [EMAIL PROTECTED] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Re: [R] MIME decoding in Mozilla Thunderbird
CG Pettersson wrote: Hello all! I recently switched mail program to Mozilla Thunderbird, running on W2k. Everything works fine, except the MIME-digests from this list. The decoding doesn´t work properly. I had contact with Martin Maechler some time ago, and he suggested a try on this list for ideas on how to do the decoding in Windows, even if it´s not a proper R-question. Outlook do the proper job on the MIME, but using that is to go a bit far! Or? /CG Thunderbird, which is an otherwise wonderful mail client, does not work for r-help digests. The reason is that if you receive 100 messages in a day, Thunderbird inefficiently handles all the mime 'attachments', and navigating them all is incredibly slow. I didn't have the decoding problem you mentioned though. -- Frank E Harrell Jr Professor and Chair School of Medicine Department of Biostatistics Vanderbilt University __ [EMAIL PROTECTED] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Re: [R] AUC for logistic regression [was: (no subject)]
Joe Nocera wrote: I believe that Roman is referring to AUC as the Area Under Curve from a Receiver Operating Characteristic. If this indeed your quantity of interest - it can be calculated in R. You can download code at: http://www.bioconductor.org/repository/release1.5/package/Win32/ and/or http://biostat.ku.dk/~bxc/SPE/library/ Check out the archives - I'm sure there is more there if you search ROC instead. Cheers, Joe Quoting Spencer Graves [EMAIL PROTECTED]: What's AUC? If you mean AIC (Akaike Information Criterion), and if you fit logistic regression using glm, the help file says that glm returns an object of class glm, which is a list containing among other things an attribute aic. For example, suppose you fit a model as follows: fit - glm(y~x, famil=binomial()...) Then fit$aic returns the AIC. You may also wish to consider anova and anova.glm. hope this helps. spencer graves [EMAIL PROTECTED] wrote: Dear R-helper, I would like to compare the AUC of two logistic regression models (same population). Is it possible with R ? Thank you Roman Rouzier [[alternative HTML version deleted]] __ [EMAIL PROTECTED] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html -- Spencer Graves, PhD, Senior Development Engineer O: (408)938-4420; mobile: (408)655-4567 __ [EMAIL PROTECTED] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html Joseph J. Nocera AUC is standard output in the lrm function in the Design package (the C Index). validate.lrm computes the overfitting-corrected C index. -- Frank E Harrell Jr Professor and Chair School of Medicine Department of Biostatistics Vanderbilt University __ [EMAIL PROTECTED] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Re: [R] SAS or R software
Alexander C Cambon wrote: I apologize for adding this so late to the SAS or R software thread. This is a question, not a reply, but it seems to me to fit in well with the subject of this thread. I would like to know anyone's experiences in the following two areas below. I should add I have no experience myself in these areas: 1) Migrating from SAS to R in the choice of statistical software used for FDA reporting. (For example, was there more effort involved in areas of documentation, revision tracking, or validation of software codes?) FDA has no requirements. They accept Minitab and even accept Excel. Requirements are to be a good statistician doing quality reproducible work for its own sake. 2) Migrating from SAS to R in the choice of statistical software used for NIH reporting (or other US or non-US) government agencies) . No issues. Frank I find myself using R more and more and being continually amazed by its breadth of capabilities, though I have not tried ordering pizza yet. I use SAS, S-Plus, and, more recently, R for survival analysis and recurrent events in clinical trials. Alex Cambon Biostatistician School of Public Health and Information Sciences University of Louisville -- Frank E Harrell Jr Professor and Chair School of Medicine Department of Biostatistics Vanderbilt University __ [EMAIL PROTECTED] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Re: [R] SAS or R software
imported against the number reported by PROC CONTENTS, so this problem is easily detected and corrected with emacs. Note that with literally billions of dollars at their disposal, SAS didn't take the time to really write a procedure for PROC EXPORT. Like the R sas.get function, it generates voluminous SAS DATA step code to do the work. Regarding CDISC, the SAS transport format that is now accepted by FDA is deficient because there is no place for certain metadata (e.g., units of measurement, value labels are remote from the datasets, variable names are truncated to 8 characters). The preferred format for CDISC will become XML. -- Frank E Harrell Jr Professor and Chair School of Medicine Department of Biostatistics Vanderbilt University __ [EMAIL PROTECTED] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Re: [R] SAS or R software
Henric Nilsson wrote: Marc Schwartz said the following on 2004-12-18 01:19: As you are likely aware, other statistically relevant issues are contained in various ICH guidance documents regarding GCP considerations and principles for clinical trials: http://www.ich.org/[EMAIL PROTECTED]@_TEMPLATE=272 ICH E9 states that (p. 27): The computer software used for data management and statistical analysis should be reliable, and documentation of appropriate software testing procedures should be available. Some commercial software vendors (SAS, Insightful, and StatSoft) offer white papers stating that their software can work within an 21 CFR Part 11 compliant system. http://www.sas.com/industry/pharma/develop/papers.html http://www.insightful.com/industry/pharm/21cfr_part11_Final.pdf http://www.statsoft.com/support/whitepapers/pdf/STATISTICA_CFR.pdf Some commercial vendors (SAS and Insightful) also offers tools for validation of the installation and operation of the software. SAS has http://support.sas.com/documentation/installcenter/common/91/ts1m3/qualification_tools_guide.pdf and S-PLUS has validate(). As a statistical consultant working within the pharamceutical industry, I think that our clients find the white papers being some kind of quality seal. It signals that someone has actually thought about the issues involved, written a document about it, and even stated that it can be done. Of course, there's a lot of FUD going on here. But if our lives can be made simpler by producing similar white papers and QA tools, why not? (But for some people, only SAS will do: Last week we were audited on behalf of a client. One of the specific issues discussed were validation and the Part 11 compliance of S-PLUS. In this specific trial, data are to be transferred from Oracle Clinical - SAS - SPLUS, and they auditors were really worried about the first and last link of that chain. Finally, they suggested using only SAS... And in this particular case, Part 11 is really a non-issue since physical records exists (i.e. case report forms) and all final S-PLUS output and code will also be stored physically (i.e. print-outs) -- no need for electronic signatures here!) There is also a general guidance document for computer systems used in clinical trials here: http://www.fda.gov/ora/compliance_ref/bimo/ffinalcct.htm Though it is to be superseded by a draft document here: http://www.fda.gov/cder/guidance/6032dft.htm From the introduction (p. 2): This document provides guidance about computerized systems that are used to create, modify, maintain, archive, retrieve, or transmit clinical data required to be maintained and/or submitted to the Food and Drug Administration (FDA) The `retrieve' part is certainly applicable. ... Henric That is not clear. And since FDA allows submissions using Excel, with not even an audit trail, and with known major statistical computing errors in Excel, I am fairly certain that it is not applicable or at the least is not enforced in any meaningful way. -- Frank E Harrell Jr Professor and Chair School of Medicine Department of Biostatistics Vanderbilt University __ [EMAIL PROTECTED] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Re: [R] SAS or R software
Henric Nilsson wrote: Frank E Harrell Jr said the following on 2004-12-18 15:03: That is not clear. Perhaps. And I think this is the issue. From the clients' perspective, not a single FDA document states that you can use other software than SAS. They haven't really thought about the fact that there isn't any FDA documents encouraging the use of SAS for statistical analyses. Right. This reminds me of the worst movie of all time, Plan 9 From Outer Space, in which the psychic Creskin closes the movie by saying Can you prove that this DIDN'T happen?. I don't think that the real problem is convincing regulatory authorities that R (or any other (open-source) software for that matter) is operating adequately. But clients and auditors seems to reason along the lines of rather being safe than sorry and nobody's ever been critized for using SAS. From their perspective, when we propose using `some other' software they start thinking that it perhaps may jeopardize their trial results (and, all to often, but doesn't FDA require SAS?). Yes that is the hurdle. How to fight this? I don't know. Right now I'm thinking, If you can beat 'em, join 'em and that the way of proving that `some other' software works is through having similar documents and tools as the commercial vendors. With the job market for statisticians being excellent, I've often wondered why clinical statisticians in industry are so often timid. Statisticians need to show strength and stamina, along with good teaching skills, on this issue. And since FDA allows submissions using Excel, with not even an audit trail, and with known major statistical computing errors in Excel, I am fairly certain that it is not applicable or at the least is not enforced in any meaningful way. The general preconception seems to be that neither SAS nor Excel needs validation. E.g. the British guideline referenced in my previous email states on p. 12 that It is generally considered that there is no requirement for validation of commercial hardware and established operating systems or for packages such as the SAS system, Oracle and MS Excel, as entities in their own right. However, most are configurable systems and so need adequate control of installation and their configuration parameters. This makes me wonder about the British system. Have they not seen the serious calculation errors documented to be in Excel? Luckily for Excel, not a single word about precision and adequacy... Right. Thanks for your note Henric -Frank Henric -- Frank E Harrell Jr Professor and Chair School of Medicine Department of Biostatistics Vanderbilt University __ [EMAIL PROTECTED] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Re: [R] SAS or R software
(Ted Harding) wrote: On 19-Dec-04 Tim Churches wrote: Shawn Way wrote: I've seen multiple comments about MS Excel's precision and accuracy. Can you please point me in the right direction in locating information about these? As always, Google is your friend, but see for example http://www.nwpho.org.uk/sadb/Poisson%20CI%20in%20spreadsheets.pdf . . . Also see http://biostat.mc.vanderbilt.edu/twiki/bin/view/Main/ExcelProblems -- Frank E Harrell Jr Professor and Chair School of Medicine Department of Biostatistics Vanderbilt University __ [EMAIL PROTECTED] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Re: [R] Producing Editable Graphs for PowerPoint
Paul Hewson wrote: Hello, (apologies, I'm not entirely sure whether this question is about R or my limitations with PowerPoint). I've submitted a paper (which has been accepted) but the journal now require me to submit graphs that are editable in PowerPoint. I would be grateful for suggestions as to how I should do this. The best route seems to be to copy-and-paste the figures from the windows() device as a metafile. However, on converting the graphs to an editable format and ungrouping them, it appears that every line, tickmark, dot and dash is treated as a separate entity by PowerPoint (for example even the horizontal and vertical parts of + are separated). Alternatively, reading in the files from a saved metafile is even more problematic: having used par(new = TRUE) I have a set of layered graphs and all but the last one dissapear once I open the graphic for editing. Is this normal for a figure editable by PowerPoint or am I doing something horribly wrong somewhere? Many thanks Paul -=-=-=-=-=-=-=-=-=-=-=-= Paul Hewson Lecturer in Statistics School of Mathematics and Statistics University of Plymouth Drake Circus Plymouth PL4 8AA tel (01752) 232778 fax (01752) 232780 email [EMAIL PROTECTED] www.plymouth.ac.uk I had hoped that journals engaged in reproducible research by now. -- Frank E Harrell Jr Professor and Chair School of Medicine Department of Biostatistics Vanderbilt University __ [EMAIL PROTECTED] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Re: [R] How to display each symbol in a different color using plot with summary.formula.reverse
Greg Blevins wrote: Dear R Masters, I have searched high and low (the help archives and my various R reference material and help files) for a solution to what appears to me to be quite a simple problem. In the following syntax, variable n10 has three levels. I would like the symbols that appear in the graph for these three levels to be different colors. The best I have been able to do is to have the Key display three colors, but the symbols in the graph only show up in black and white. Any suggestions? par(cex=.8) s - summary(n10 ~ n13, method=reverse, test=T) plot(s, dotsize=1.2, cex.labels=.7, cex.axis=.5, cex.main=.5, which=categorical) Key(locator(1)) R version 2.0.1 Windows XP Thank you, Greg Blevins The Market Solutions Group Greg, These kinds of questions are probably best directed at package maintainers. This would be an enhancement to the dotchart2 function used by plot.summary.formula.reverse, and the addition of a new argument to it from plot I have found symbols (esp. circle vs. triangle) to be more effective for this purpose so I am not motivated to work on this very soon but would consider it. -Frank -- Frank E Harrell Jr Professor and Chair School of Medicine Department of Biostatistics Vanderbilt University __ [EMAIL PROTECTED] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Re: [R] GAM: Overfitting
Jean G. Orelien wrote: I am analyzing particulate matter data (PM10) on a small data set (147 observations). I fitted a semi-parametric model and am worried about overfitting. How can one check for model fit in GAM? Jean G. Orelien It's good to separate 'model fit' (or lack of fit) from 'overfitting'. Overfitting can cause the model fit to appear to be excellent, but there is still a huge problem. -- Frank E Harrell Jr Professor and Chair School of Medicine Department of Biostatistics Vanderbilt University __ [EMAIL PROTECTED] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Re: [R] Developing R classes
Gabor Grothendieck wrote: Chambers book, Programming with Data, has a lot about S4. Note that S4 has more features but S3 is simpler and has higher performance so its not all one way. Also, they are related so if you learn the simpler S3 first it will make it easier to learn S4 later. I suggest you download and read the source code for package `zoo' for S3 and package `its' for S4. Both packages define irregular time series classes. I agree with Gabor. S3 is much easier and faster to implement and has flexibility advantages (e.g., adding new elements to fit objects for debugging). Frank Date: Tue, 28 Dec 2004 13:46:10 -0300 (ART) From: Gilvan Justino [EMAIL PROTECTED] To: r-help@stat.math.ethz.ch Subject: [R] Developing R classes Hi, I´m trying to write some R classes but I din´t find documentation enought to develop them. I read there is 2 ways to write classes, using S3 ou S4 models. And it seems that S4 is the best model, so I thing I should use this one. I´m new user of R and I´m searched on the net some information about creating new classes. I found this document: http://www.biostat.harvard.edu/courses/individual/bio271/lectures/L11/S4Objects.pdf If someone knows some docs about creating our own classes, could please, post its url at here ? More one question... It seems that some documents about S can be applyed to R. How do you know if you can use this documentation using R ? I´m asking this because some stufs I was looking for I found more explanation using S and not R. Am I right or I´m looking in the wrong place ? Thanks a lot! __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html -- Frank E Harrell Jr Professor and Chair School of Medicine Department of Biostatistics Vanderbilt University __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
[R] Black and white graphics and transparent strip panels with lattice under Sweave
What is the most elegant way to specify that strip panels are to have transparent backgrounds and graphs are to be in black and white when lattice is being used with Sweave? I would prefer a global option that stays in effect for multiple plots. If this is best done with a theme, does anyone have a lattice theme like col.whitebg but that is for black and white? I'm using the following with lattice 0.10-16, grid 2.0.0: platform i586-mandrake-linux-gnu arch i586 os linux-gnu system i586, linux-gnu status major2 minor0.0 year 2004 month10 day 04 language R Thanks, Frank -- Frank E Harrell Jr Professor and Chair School of Medicine Department of Biostatistics Vanderbilt University __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Re: [R] Black and white graphics and transparent strip panels with lattice under Sweave
Deepayan Sarkar wrote: On Sunday 02 January 2005 19:40, Frank E Harrell Jr wrote: What is the most elegant way to specify that strip panels are to have transparent backgrounds and graphs are to be in black and white when lattice is being used with Sweave? I would prefer a global option that stays in effect for multiple plots. If this is best done with a theme, does anyone have a lattice theme like col.whitebg but that is for black and white? I'd do something like this as part of the initialization: ... library(lattice) ltheme - canonical.theme(color = FALSE) ## in-built BW theme ltheme$strip.background$col - transparent ## change strip bg lattice.options(default.theme = ltheme) ## set as default @ Deepayan That worked perfectly. Thank you very much Deepayan. -Frank __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Re: [R] different DF in package nlme and lme4
Douglas Bates wrote: Christoph Buser wrote: Hi all I tried to reproduce an example with lme and used the Orthodont dataset. library(nlme) fm2a.1 - lme(distance ~ age + Sex, data = Orthodont, random = ~ 1 | Subject) anova(fm2a.1) ... Regards, Christoph Buser No. The calculation of denominator degrees of freedom in lme4 is bogus and I believe this is documented. Note that for all practical purposes there is very little difference between 25 and 100 denominator degrees of freedom. lme4 is under development (and has been for a seemingly interminable period of time). Getting the denominator degrees of freedom calculation right is way down the list of priorities. Many people express dismay about the calculation of denominator degrees of freedom in all versions of lme4. IIRC Frank Harrell characterizes this as one of the foremost deficiencies in R relative to SAS. I don't agree that this is a glaring deficiency. In fact I believe that there is no correct answer. The F statistics in a mixed model do not have an F distribution under the null hypothesis. It's all an approximation, which is why I don't stay up nights worrying about the exact details of the approximation. Doug - the main concern is accurate P-values; I don't really care which approximations are best, just that the ones used are at least as good as those in SAS. Without being an expert, I have come to believe that at the moment SAS is better than R in 2 areas: accurate P-values from mixed models and handling massive databases. On the former point I could easily be swayed by some type I error simulations. My plan for lme4 is that one slot in the summary object for an lme model will be an incidence table of terms in the fixed effects versus grouping factors for the random effects. This table will indicate whether a given term varies within groups defined by the grouping factor. Anyone who wants to implement their personal favorite calculation of denominator degrees of freedom based on this table will be welcome to do so. I will be interested also to see timings of lme4 (using S4) vs nlme (using S3) for the same model. Cheers, Frank I personally think that tests on the fixed-effects terms will be better implemented using the restricted likelihood-ratio tests defined by Reinsel and Ahn rather than the Wald tests and the whole issue of denominator degrees of freedom may be moot. My apologies if I seem to be peeved. I am not upset by your question - it is an entirely reasonable question. It is just that I have discussed the issue of denominator degrees of freedom too many times. To me a more important objective of lme4 is to be able to handle random effects associated with crossed or partially crossed grouping factors. I believe that in those cases the calculation of denominator degrees of freedom will be very complicated and even more of an approximation than in the case of nested grouping factors. This is why I would rather finesse the whole issue by using the Reinsel and Ahn approach. __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
[R] JOBS: Vanderbilt University Department of Biostatistics
The well-supported and rapidly-growing Department of Biostatistics in the Vanderbilt University School of Medicine, Nashville, Tennessee, has multiple openings for persons with an MS or PhD in biostatistics, at all levels. We are especially recruiting for the following positions. - A senior faculty member to lead a multi-center clinical trial data coordinating center. The individual must have extensive experience in NIH-sponsored data coordinating centers. - A senior faculty member with experience in health services or outcomes research or clinical epidemiology to lead a biostatistics unit for surgical sciences. - Senior MS biostatisticians with 5 or more years of experience. At Vanderbilt, such persons have faculty positions. - Associate professors - Junior MS biostatisticians We are also seeking assistant and full professors. The department currently has 16 PhD faculty biostatisticians, 6 senior and 5 junior MS biostatisticians, and 10 computer systems analysts and plans to grow significantly in size over the next three years, continuing to add a significant number of faculty and staff biostatisticians after that. In general we seek individuals with expertise in clinical trials, health services and outcomes research, Bayesian methods, multicenter clinical trial data management, imaging, and modern statistical computing (especially R or S-Plus). See http://biostat.mc.vanderbilt.edu/twiki/bin/view/Main/JobOpenings for more information. All candidates must enjoy collaborative research and possess excellent written and verbal communication skills. Send CV to Search Committee, Biostatistics, S-2323 MCN, Vanderbilt University, Nashville, TN 37232-2158. Vanderbilt University is an equal opportunity/affirmative action employer; all qualified persons are encouraged to apply. CVs may be sent electronically to both [EMAIL PROTECTED] and [EMAIL PROTECTED] Please state in the subject of the E-mail the position for which you are applying or inquiring. -- Frank E Harrell Jr Professor and Chair School of Medicine Department of Biostatistics Vanderbilt University __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Re: [R] labels attached to variable names
Denis Chabot wrote: Hi, Can we attach a more descriptive label (I may use the wrong terminology, which would explain why I found nothing on the FAQ) to variable names, and later have an easy way to switch to these labels in plots? I fear this is not possible and one must enter this by hand as ylab and xlab when making plots. Thanks in advance, Denis Chabot The Hmisc package supports this: label(x) - 'Some Label' describe(x)# uses variable name and label plot(x, y, xlab=label(x)) Better: xYplot(x, y) # label used automatically And if you do units(x) - 'whatever units of measurement' then xYplot, describe, and other Hmisc functions will include the units (in a different font on graphs or when using latex()). Frank -- Frank E Harrell Jr Professor and Chair School of Medicine Department of Biostatistics Vanderbilt University __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Re: [R] R-etiquette
Anne wrote: I'm about to present a report (for internal use of governmental agency). I used extensively R , contibuted packages, as well as communications on the R-list As well as citing R, I would like to know how to cite the contributed packages (it is not so easy, as some have been used exensively, other marginally, some are called from another package and some were not used as softwares but gave me insight of how to proceed), as well as thank the persons who generously responded to my demands on the list... Is there an established way of doing that? Thanks I take the occasion of thanking all of you! The R project is really a wonderful idea and realisation Anne Good question Anne. I don't know of an established way but here are examples based on the way I prefer people to site my packages. For Hmisc, one of the two following ways works (BibTeX format): @MISC{Hmisc, author = {Harrell, Frank E.}, year = 2005, title = {{\tt Hmisc}: {A} library of miscellaneous \textsc{S} functions}, howpublished = {Available from {\tt biostat.\-mc.\-vanderbilt.\-edu/\-s/Hmisc}} } or @MISC{alzsplus, author = {Alzola, Carlos F. and Harrell, Frank E.}, year = 2004, title = {An {Introduction} to {S} and the {Hmisc} and {Design} {Libraries}. {Available} from {\tt http://biostat.mc.vanderbilt.edu/\-twiki/pub/Main/\-RS/sintro.pdf}.}, note = {Electronic book, 308 pages} } For Design one of the following 2: @book{rms, author = {Harrell, Frank E.}, year = 2001, title = {Regression Modeling Strategies, with Applications to Linear Models, Survival Analysis and Logistic Regression}, publisher = {Springer}, address = {New York} } or @MISC{Design, author = {Harrell, Frank E.}, year = 2005, title = {{\tt Design}: {S} functions for biostatistical/epidemiologic modeling, testing, estimation, validation, graphics, prediction, and typesetting by storing enhanced model design attributes in the fit. Available from {\tt biostat.\-mc.\-vanderbilt.\-edu/s/\-{Design}}} } I am never clear on the proper year to use for referencing the packages by themselves, as the packages are constantly being improved. Frank Anne Piotet Tel: +41 79 359 83 32 (mobile) Email: [EMAIL PROTECTED] --- M-TD Modelling and Technology Development PSE-C CH-1015 Lausanne Switzerland Tel: +41 21 693 83 98 Fax: +41 21 646 41 33 -- [[alternative HTML version deleted]] __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html -- Frank E Harrell Jr Professor and Chair School of Medicine Department of Biostatistics Vanderbilt University __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Re: [R] R-etiquette
Prof Brian Ripley wrote: On Sun, 9 Jan 2005, Anne wrote: I'm about to present a report (for internal use of governmental agency). I used extensively R , contibuted packages, as well as communications on the R-list As well as citing R, I would like to know how to cite the contributed packages (it is not so easy, as some have been used exensively, other marginally, some are called from another package and some were not used as softwares but gave me insight of how to proceed), as well as thank the persons who generously responded to my demands on the list... Is there an established way of doing that? See the citation() function, which works for packages too and a few authors have taken advantage of the means of customizing it. I keep learning - thanks Brian. I will add CITATION files in the parent directory of my packages. The defaults generated from citation('packagename') for them are not bad. -Frank __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Re: [R] how to do by-processing using weighted.mean
Denis Chabot wrote: Hi, I'd like to compute the weighted mean of some variables, all using the same weight variable, for each combination of 3 factor variables. I found how I could use summarize (from Hmisc) to do normal means for combinations of 3 factors, but I cannot find a way of doing weighted means. Is it possible in R? Thanks in advance, Denis Chabot __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html It's in one of the summarize examples! -- Frank E Harrell Jr Professor and Chair School of Medicine Department of Biostatistics Vanderbilt University __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Re: [R] R-etiquette
Liaw, Andy wrote: I'd thank them in the acknowledgement section. I think some (most?) journal will allow such a section, and most people use that to thank their research funding sources and/or collaborators who had not made the list of authors. Andy You do need to get permission from people to include their names in an Ack. section, usually. Referencing as a personal communication is perhaps better. -- Frank E Harrell Jr Professor and Chair School of Medicine Department of Biostatistics Vanderbilt University __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Re: [R] Standard error for the area under a smoothed ROC curve?
Dan Bolser wrote: Hello, I am making some use of ROC curve analysis. I find much help on the mailing list, and I have used the Area Under the Curve (AUC) functions from the ROC function in the bioconductor project... http://www.bioconductor.org/repository/release1.5/package/Source/ ROC_1.0.13.tar.gz However, I read here... http://www.medcalc.be/manual/mpage06-13b.php The 95% confidence interval for the area can be used to test the hypothesis that the theoretical area is 0.5. If the confidence interval does not include the 0.5 value, then there is evidence that the laboratory test does have an ability to distinguish between the two groups (Hanley McNeil, 1982; Zweig Campbell, 1993). But aside from early on the above article is short on details. Can anyone tell me how to calculate the CI of the AUC calculation? I read this... http://www.bioconductor.org/repository/devel/vignette/ROCnotes.pdf Which talks about resampling (by showing R code), but I can't understand what is going on, or what is calculated (the example given is specific to microarray analysis I think). I think a general AUC CI function would be a good addition to the ROC package. One more thing, in calculating the AUC I see the splines function is recomended over the approx function. Here... http://tolstoy.newcastle.edu.au/R/help/04/10/6138.html How would I rewrite the following AUC functions (adapted from bioconductor source) to use splines (or approxfun or splinefun) ... spe # Specificity [1] 0.02173913 0.13043478 0.21739130 0.32608696 0.43478261 0.54347826 [7] 0.65217391 0.76086957 0.89130435 1. 1. 1. [13] 1. sen # Sensitivity [1] 1.000 1.000 1.000 1.000 1.000 0.9302326 0.8139535 [8] 0.6976744 0.5581395 0.4418605 0.3488372 0.2325581 0.1162791 trapezint(1-spe,sen) my.integrate(1-spe,sen) ## Functions ## Nicked (and modified) from the ROC function in bioconductor. trapezint - function (x, y, a = 0, b = 1) { if (x[1] x[length(x)]) { x - rev(x) y - rev(y) } y - y[x = a x = b] x - x[x = a x = b] if (length(unique(x)) 2) return(NA) ya - approx(x, y, a, ties = max, rule = 2)$y yb - approx(x, y, b, ties = max, rule = 2)$y x - c(a, x, b) y - c(ya, y, yb) h - diff(x) lx - length(x) 0.5 * sum(h * (y[-1] + y[-lx])) } my.integrate - function (x, y, t0 = 1) { f - function(j) approx(x,y,j,rule=2,ties=max)$y integrate(f, 0, t0)$value } Thanks for any pointers, Dan. I don't see why the above formulas are being used. The Bamber-Hanley-McNeil-Wilcoxon-Mann-Whitney nonparametric method works great. Just get the U statistic (concordance probability) used in Wilcoxon. As Somers' Dxy rank correlation coefficient is 2*(1-C) where C is the concordance or ROC area, the Hmisc package function rcorr.cens uses U statistic methods to get the standard error of Dxy. You can easily translate this to a standard error of C. Frank -- Frank E Harrell Jr Professor and Chair School of Medicine Department of Biostatistics Vanderbilt University __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Re: [R] transcan() from Hmisc package for imputing data
avneet singh wrote: Hello: I have been trying to impute missing values of a data frame which has both numerical and categorical values using the function transcan() with little luck. Would you be able to give me a simple example where a data frame is fed to transcan and it spits out a new data frame with the NA values filled up? Or is there any other function that i could use? Thank you avneet It's in the help file for transcan. But multiple imputation is much better, and transcan does not do multiple imputation as well as the newer Hmisc function aregImpute. -- Frank E Harrell Jr Professor and Chair School of Medicine Department of Biostatistics Vanderbilt University __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Re: [R] Kolmogorov-Smirnof test for lognormal distribution with estimated parameters
Christoph Buser wrote: Hi Kwabena I did once a simulation, generating normal distributed values (500 values) and calculating a KS test with estimated parameters. For 1 times repeating this test I got about 1 significant tests (on a level alpha=0.05 I'm expecting about 500 significant tests by chance) So I think if you estiamte the parameters from the data, you fit to good and the used distribution of the test statistic is not adequate as it is indicated in the help page you cited. There (in the help page) is some literature, but it is no easy stuff to read. Furthermore I know no implementation of an KS test which accounts for this estimation of the parameter. I recommend a graphical tool instead of a test: x - rlnorm(100) qqnorm(log(x)) See also ?qqnorm and ?qqplot. If you insist on testing a theoretical distribution be aware that a non significant test does not mean that your data has the tested distribution (especially if you have few data, there is no power in the test to detect deviations from the theoretical distribution and the conclusion that the data fits well is trappy) If there are enough data I'd prefer a chi square test to the KS test (but even there I use graphical tools instead). See ?chisq For this test you have to specify classes and this is subjective (you can't avoid this). You can reduce the DF of the expected chi square distribution (under H_0) by the number of estimated parameters from the data and will get better results. DF = number of classes - 1 - estimated parameters I think this test is more powerful than the KS test, particularly if you must estimate the parameters from data. Regards, Christoph It is also a good idea to ask why one compares against a known distribution form. If you use the empirical CDF to select a parametric distribution, the final estimate of the distribution will inherit the variance of the ECDF. The main reason statisticians think that parametric curve fits are far more efficient than nonparametric ones is that they don't account for model uncertainty in their final confidence intervals. -- Frank E Harrell Jr Professor and Chair School of Medicine Department of Biostatistics Vanderbilt University __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Re: [R] Standard error for the area under a smoothed ROC curve?
Dan Bolser wrote: On Wed, 12 Jan 2005, Frank E Harrell Jr wrote: Dan Bolser wrote: Hello, I am making some use of ROC curve analysis. I find much help on the mailing list, and I have used the Area Under the Curve (AUC) functions from the ROC function in the bioconductor project... http://www.bioconductor.org/repository/release1.5/package/Source/ ROC_1.0.13.tar.gz However, I read here... http://www.medcalc.be/manual/mpage06-13b.php The 95% confidence interval for the area can be used to test the hypothesis that the theoretical area is 0.5. If the confidence interval does not include the 0.5 value, then there is evidence that the laboratory test does have an ability to distinguish between the two groups (Hanley McNeil, 1982; Zweig Campbell, 1993). But aside from early on the above article is short on details. Can anyone tell me how to calculate the CI of the AUC calculation? I read this... http://www.bioconductor.org/repository/devel/vignette/ROCnotes.pdf Which talks about resampling (by showing R code), but I can't understand what is going on, or what is calculated (the example given is specific to microarray analysis I think). I think a general AUC CI function would be a good addition to the ROC package. One more thing, in calculating the AUC I see the splines function is recomended over the approx function. Here... http://tolstoy.newcastle.edu.au/R/help/04/10/6138.html How would I rewrite the following AUC functions (adapted from bioconductor source) to use splines (or approxfun or splinefun) ... spe # Specificity [1] 0.02173913 0.13043478 0.21739130 0.32608696 0.43478261 0.54347826 [7] 0.65217391 0.76086957 0.89130435 1. 1. 1. [13] 1. sen # Sensitivity [1] 1.000 1.000 1.000 1.000 1.000 0.9302326 0.8139535 [8] 0.6976744 0.5581395 0.4418605 0.3488372 0.2325581 0.1162791 trapezint(1-spe,sen) my.integrate(1-spe,sen) ## Functions ## Nicked (and modified) from the ROC function in bioconductor. trapezint - function (x, y, a = 0, b = 1) { if (x[1] x[length(x)]) { x - rev(x) y - rev(y) } y - y[x = a x = b] x - x[x = a x = b] if (length(unique(x)) 2) return(NA) ya - approx(x, y, a, ties = max, rule = 2)$y yb - approx(x, y, b, ties = max, rule = 2)$y x - c(a, x, b) y - c(ya, y, yb) h - diff(x) lx - length(x) 0.5 * sum(h * (y[-1] + y[-lx])) } my.integrate - function (x, y, t0 = 1) { f - function(j) approx(x,y,j,rule=2,ties=max)$y integrate(f, 0, t0)$value } Thanks for any pointers, Dan. I don't see why the above formulas are being used. The Bamber-Hanley-McNeil-Wilcoxon-Mann-Whitney nonparametric method works great. Just get the U statistic (concordance probability) used in Wilcoxon. As Somers' Dxy rank correlation coefficient is 2*(1-C) where C is the concordance or ROC area, the Hmisc package function rcorr.cens uses U statistic methods to get the standard error of Dxy. You can easily translate this to a standard error of C. I am sure I could do this easily, except I can't. The good thing about ROC is that I understand it (I can see it). I know why the area means what it means, and I could even imagine how sampling the data could give a CI on the area. However, I don't know why the area under the ROC curve is well known to be equivalent to the numerator of the Mann-Whitney U statistic - from http://www.bioconductor.org/repository/devel/vignette/ROCnotes.pdf Nor do I know how to calculate the numerator of the Mann-Whitney U statistic. This is clear in the original Bamber or Hanley-McNeil articles. The ROC area is a linear translation of the mean rank of predicted values in one of the two outcome groups. The little somers2 function in Hmisc shows this: ##S function somers2 ## ##Calculates concordance probability and Somers' Dxy rank correlation ##between a variable X (for which ties are counted) and a binary ##variable Y (having values 0 and 1, for which ties are not counted). ##Uses short cut method based on average ranks in two groups. ## ##Usage: ## ## somers2(X,Y) ## ##Returns vector whose elements are C Index, Dxy, n and missing, where ##C Index is the concordance probability and Dxy=2(C Index-.5). ## ##F. Harrell 28 Nov 90 6 Apr 98: added weights somers2 - function(x, y, weights=NULL, normwt=FALSE, na.rm=TRUE) { if(length(y)!=length(x))stop(y must have same length as x) y - as.integer(y) wtpres - length(weights) if(wtpres (wtpres != length(x))) stop('weights must have same length as x') if(na.rm) { miss - if(wtpres) is.na(x + y + weights) else is.na(x + y) nmiss - sum(miss) if(nmiss0) { miss - !miss x - x[miss] y - y[miss] if(wtpres) weights - weights[miss] } } else nmiss - 0 u - sort(unique(y)) if(any(y %nin% 0:1)) stop('y must be binary') ## 7dec02 if(wtpres) { if(normwt) weights - length(x
Re: [R] empirical (sandwich) SE estimate in lme ()?
A.J. Rossini wrote: Right, but you still can sandwich them if you want. (I recently did that in Proc MIXED, but Michael, I'm not sure how to do it using lme). best, -tony The sandwich estimator can help if the model is misspecified but at a cost of worse precision in estimating variances. Frank On Fri, 14 Jan 2005 11:16:10 -0800, Berton Gunter [EMAIL PROTECTED] wrote: ??? correlated within group errors are explicitly modeled by corStruct classes. See ?lme and Chapter 5.3 in Bates and Pinheiro. -- Bert Gunter Genentech Non-Clinical Statistics South San Francisco, CA The business of the statistician is to catalyze the scientific learning process. - George E. P. Box -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Na Li Sent: Friday, January 14, 2005 8:42 AM To: r-help@stat.math.ethz.ch Subject: [R] empirical (sandwich) SE estimate in lme ()? Is it possible to get the empirical (sandwich) S.E. estimates for the fixed effects in lme () (thus allowing possibly correlated errors within the group)? In SAS you can get it by the 'empirical' option to PROC MIXED. Cheers, Michael -- Na (Michael) Li, Ph.D. Division of Biostatistics A443 Mayo Building, MMC 303 School of Public Health420 Delaware St SE University of MinnesotaMinneapolis, MN 55455 Phone: (612) 626-4765 Email: [EMAIL PROTECTED] Fax: (612) 626-0660 http://www.biostat.umn.edu/~nali __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html -- Frank E Harrell Jr Professor and Chair School of Medicine Department of Biostatistics Vanderbilt University __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Re: [R] Omitting constant in ols() from Design
Tobias Muhlhofer wrote: Hi! I need to run ols regressions with Huber-White sandwich estimators and the correponding standard errors, without an intercept. What I'm trying to do is create an ols object and then use the robcov() function, on the order of: f - ols(depvar ~ ind1 + ind2, x=TRUE) robcov(f) However, when I go f - ols(depvar ~ ind1 + ind2 -1, x=TRUE) I get the following error: Error in ols(nareit ~ SnP500 + d3yrtr - 1) : length of dimnames [2] not equal to array extent same with +0 instead of -1. Is there a different way to create an ols object without a constant? I can't use lm(), because robcov() needs an object from the Design() series. Or is there a different way to go about this? Tobias Muhlhofer ols does not support this. Sorry. -- Frank E Harrell Jr Professor and Chair School of Medicine Department of Biostatistics Vanderbilt University __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Re: [R] bwplot: how not to draw outliers
Sundar Dorai-Raj wrote: RenE J.V. Bertin wrote: Hello, and (somewhat belated) best wishes for 2005. Can one order not to draw outliers in bwplot, or at least exclude them from the vertical axis scaling? If so, how (or what doc do I need to consult)? The options that have this effect in boxplot() do not appear to have any effect with bwplot (although outline=FALSE in boxplot does *not* change the scaling). Thanks, RenE Bertin RenE, There may be other solutions but you can do this using the prepanel option to set the ylim: library(lattice) set.seed(1) z - data.frame(x = rt(100, 1), g = rep(letters[1:4], each = 25)) bwplot(x ~ g, z, prepanel = function(x, y) { bp - boxplot(split(y, x), plot = FALSE) ylim - range(bp$stats) list(ylim = ylim) }) If you actually want to exclude the points (rather than just prevent outliers from affecting the scale), you will have to modify the panel.bwplot function in addition to using the above. --sundar You may also want to try library(Hmisc) library(lattice) bwplot(..., panel=panel.bpplot) ?panel.bpplot -- Frank E Harrell Jr Professor and Chair School of Medicine Department of Biostatistics Vanderbilt University __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Re: [R] (no subject)
Virginie Rondeau wrote: Hello I would like to compare the results obtained with a classical non parametric proportionnal hazard model with a parametric proportionnal hazard model using a Weibull. How can we obtain the equivalence of the parameters using coxph(non parametric model) and survreg(parametric model) ? Thanks Virginie In the Design package look at the pphsm function that converts a survreg Weibull fit (fitted by the psm function which is an adaptation of survreg) to PH form. -- Frank E Harrell Jr Professor and Chair School of Medicine Department of Biostatistics Vanderbilt University __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Re: [R] Cross-validation accuracy in SVM
Ton van Daelen wrote: Hi all - I am trying to tune an SVM model by optimizing the cross-validation accuracy. Maximizing this value doesn't necessarily seem to minimize the number of misclassifications. Can anyone tell me how the cross-validation accuracy is defined? In the output below, for example, cross-validation accuracy is 92.2%, while the number of correctly classified samples is (1476+170)/(1476+170+4) = 99.7% !? Thanks for any help. Regards - Ton Percent correctly classified is an improper scoring rule. The percent is maximized when the predicted values are bogus. In addition, one can add a very important predictor and have the % actually decrease. Frank Harrell --- Parameters: SVM-Type: C-classification SVM-Kernel: radial cost: 8 gamma: 0.007 Number of Support Vectors: 1015 ( 148 867 ) Number of Classes: 2 Levels: false true 5-fold cross-validation on training data: Total Accuracy: 92.24242 Single Accuracies: 90 93.3 94.84848 92.72727 90.30303 Contingency Table predclasses origclasses false true false 1476 0 true 4 170 __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html -- Frank E Harrell Jr Professor and Chair School of Medicine Department of Biostatistics Vanderbilt University __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Re: [R] logistic regression 3D-plot
[EMAIL PROTECTED] wrote: Dear R-helpers, I tried to create a 3D surface showing the interaction between two continuous explanatory variables; the response variable is binary (0/1). The model is: model-glm(incidence~sun*trees,binomial) then I used wireframe to create a 3D plot: xyz-expand.grid(sun=seq(30,180,1),trees=seq(0,4000,10)) xyz$incidence-as.vector(predict(model,xyz)) wireframe(incidence~sun*trees,xyz,scales=list(arrows=FALSE)) which gives me a 3D plot, but the scaling of the y-axis is wrong. the range is not from 0 to 1. so my question: is there a way to plot these kind of models, with binary response variables? thanks for your help, Heike library(Design) d - datadist(mydata); options(datadist='d') f - lrm(incidence ~ sun*trees) # lrm is for binary or ordinal response plot(f, sun=NA, trees=NA) # add method='image' or 'contour' to get other types of graphs plot(f, sun=NA, trees=NA, fun='plogis') # probability scale -- Frank E Harrell Jr Professor and Chair School of Medicine Department of Biostatistics Vanderbilt University __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Re: [R] logistic regression 3D-plot CORRECTION
library(Design) d - datadist(mydata); options(datadist='d') f - lrm(incidence ~ sun*trees) # lrm is for binary or ordinal response plot(f, sun=NA, trees=NA) # add method='image' or 'contour' to get other types of graphs plot(f, sun=NA, trees=NA, fun='plogis') # probability scale Correction: fun=plogis not 'plogis'. Sorry about that -FH __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Re: [R] Reading Dates in a csv File
Charles and Kimberly Maner wrote: Hi all. I'm reading in a flat, comma-delimited flat file using read.csv. It works marvelously for the most part. I am using the colClasses argument to, basically, create numeric, factor and character classes for the columns I'm reading in. However, a couple of the fields in the file are date fields. I'm fully aware that POSIXct can be used as a class, however the field must obey, (I think), the standard/default POSIXct format. Hence the following question: Does anyone have a method they can share to read in a non-standard formatted date to convert to POSIXct? I can read it in then convert it, but that's a two pass approach and not as elegant as a single pass through read.csv. I've read, from the documentation, that [o]therwise there needs to be an as method (from package methods) for conversion from character to the specified formal class but I do not know and have not figured out how to do that. Any suggestion(s) would be greatly appreciated. Thanks, Charles The csv.get function in the Hmisc package may do most of what you want. -- Frank E Harrell Jr Professor and Chair School of Medicine Department of Biostatistics Vanderbilt University __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Re: [R] no. at risk in survfit()
array chip wrote: Hi, when I generated a survfit() object, I can get number of patients at risk at various time points by using summary(): fit-survfit(Surv(time,status)~class,data=mtdata) summary(fit) class=1 time n.risk n.event survival std.err lower 95% CI upper 95% CI 9.9 78 10.987 0.0127 0.963 1 41.5 77 10.974 0.0179 0.940 1 54.0 76 10.962 0.0218 0.920 1 99.1 38 10.936 0.0328 0.874 1 class=2 time n.risk n.event survival std.err lower 95% CI upper 95% CI 6.9102 10.990 0.00976 0.971 1.000 8.0101 10.980 0.01373 0.954 1.000 14.4100 10.971 0.01673 0.938 1.000 16.1 99 10.961 0.01922 0.924 0.999 16.6 98 10.951 0.02138 0.910 0.994 18.7 97 10.941 0.02330 0.897 0.988 : : : I have many censoring observations in the dataset, and I would like to know the number of patients at risk (n.risk in the above output) for certain time points, for example at 60, 72, etc, which is not available from the above printout for class=1. Is there anyway I can get them? Thanks The Design package's survplot function can print n.risk over equally spaced time points. You might see an easy way to print this by looking at the code. -Frank -- Frank E Harrell Jr Professor and Chair School of Medicine Department of Biostatistics Vanderbilt University __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Re: [R] Programming/scripting with expressions - variables
Gorjanc Gregor wrote: Hello to Rusers! I am puzzled with R and I really do not know where to look in for my problem. I am moving from SAS and I have difficulties in translating SAS to R world. I hope I will get some hints or pointers so I can study from there on. I would like to do something like this. In SAS I can write a macro as example bellow, which is afcourse a silly one but shows what I don't know how to do in R. %macro test(data, colname, colvalue); data data; ... colname=colvalue; other_colname=other_colvalue; run; %mend; And if I run it with this call: %test(Gregor, Gorjanc, 25); I get a table with name 'Gregor' and columns 'Gorjanc', and 'other_Gorjanc' with values: Gorjanc other_Gorjanc 25other_25 So can one show me the way to do the same thing in R? Thanks! -- Lep pozdrav / With regards, Gregor GORJANC Greetings, Gregor, from a fan of Slovenia. Here is a start. Multiple variables are often put together in data frames. Your example did not include an input dataset name. I took data to be an input data frame, and added two new variables. I assume that colvalue and colname are character values. If colvalue has length one it will be duplicated for each row of data. test - function(data, colname, colvalue) { data[[colname]] - colvalue data[[paste('other',colname,sep='_')]] - paste('other',colvalue,sep='_') data } Example usage: data - data.frame(x=1:5, y=11:15) data2 - test(data, 'a', 'b') data - test(data, 'a', 'b') # over-write test(data, 'a', 'b') # create data but don't store with(test(data, ),{...})# reference variables in temporary dataset If we know what your ultimate goal is, there may be a much better approach than the test function. In many cases, you do not need to create new variables at all. Franc -- Frank E Harrell Jr Professor and Chair School of Medicine Department of Biostatistics Vanderbilt University __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Re: [R] Re:logistic regression
Vito Ricci wrote: Hi, I don't know if a pseudo squared R for glm exists in any R package, but I find some interesting functions in S mailing list: It is included in lrm in the Design package. But note that this is not for checking fit but rather for quantifying predictive discrimination. . -- Frank E Harrell Jr Professor and Chair School of Medicine Department of Biostatistics Vanderbilt University __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Re: [R] Re:logistic regression
Joe Nocera wrote: Helene - In addition to some of the excellent suggestions already posited (e.g. examining AIC, pseudo R^2 in the Design package), you might want to consider another tool to assess logistic regression model accuracy: the area-under-curve (AUC) from a receiver-operating characteristic (ROC). The ROC curve describes the relationship between the number of true positives observed (sensitivity) to false positives, and also for negatives. The AUC is the probability that a model can correctly distinguish between the two. This is an appealling alternative to some of the known issues of citing only a pseudo-R^2 (like Nagelkerke's for instance) to describe 'fit'. That's also standard output in Design's lrm function (C index). But unlike R^2 comparing two models on the basis of ROC area is not very sensitive. -Frank Check out the ROC functions available at the Bioconductor website. There was also some code sent around on the list a few months back for calculating trapeziodal AUC, se's from ROC, and comparing two ROC curves...search the archives if interested, or I can probably dig them out for you offline... Cheers, Joe Quoting Frank E Harrell Jr [EMAIL PROTECTED]: Vito Ricci wrote: Hi, I don't know if a pseudo squared R for glm exists in any R package, but I find some interesting functions in S mailing list: It is included in lrm in the Design package. But note that this is not for checking fit but rather for quantifying predictive discrimination. . -- Frank E Harrell Jr Professor and Chair School of Medicine Department of Biostatistics Vanderbilt University __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html Joseph J. Nocera Ph.D. Candidate NB Coop. Fish Wildlife Research Unit Biology Department - Univ. New Brunswick Fredericton, NB Canada E3B 6E1 tel: (902) 679-5733 Why does it have to be spiders? Why can't it be 'follow the butterflies'?! Ron Weasley, Harry Potter The Chamber of Secrets __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html -- Frank E Harrell Jr Professor and Chair School of Medicine Department of Biostatistics Vanderbilt University __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Re: [R] Forward Stepwise regression based on partial F test
Smit, Robin wrote: I am hoping to get some advise on the following: I am looking for an automatic variable selection procedure to reduce the number of potential predictor variables (~ 50) in a multiple regression model. I would be interested to use the forward stepwise regression using the partial F test. I have looked into possible R-functions but could not find this particular approach. There is a function (stepAIC) that uses the Akaike criterion or Mallow's Cp criterion. In addition, the drop1 and add1 functions came closest to what I want but with them I cannot perform the required procedure. Do you have any ideas? Kind regards, Robin Smit Business Unit TNO Automotive Environmental Studies Testing PO Box 6033, 2600 JA Delft THE NETHERLANDS Robin, If you are looking for a method that does not offer the best predictive accuracy and that violates every aspect of statistical inference, you are on the right track. See http://www.stata.com/support/faqs/stat/stepwise.html for details. -- Frank E Harrell Jr Professor and Chair School of Medicine Department of Biostatistics Vanderbilt University __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Re: [R] Teaching R in 40 minutes. What should be included?
Spencer Graves wrote: I agree with Thomas and Georg: A 40 minute intro should be mostly Marketing and very little how to. I think you'll have a more effective sales job if you target, say, 4 examples averaging 5 slides each plus some general overview, max 25-30 slides. If I had sufficient prep time and a few collaborators among physicists, geographers, etc., I might get their help in preparing examples, showing how they would do something in Matlab or Scilab or something else vs. R. And I'd end with a discussion of technical support via an R site search and r-help and showing a list of available contributed packages. I'd do a couple of searches for physics and geographical questions. ODESOLVE, maps, etc. Maybe pick examples that are part of the help files. Then show, here is how I find X, here is the vignette, help or whatever. Is it fair to say that R is rapidly becoming (if it is not already) the primary platform of choice for new statistical algorithm development? I think they might be interested in a brief overview of the contributed software. If this is an academic audience, they might like to know how easy it is to contribute software, plus journal on statistical computing and graphics, etc. hope this helps. Good Luck! spencer graves I often give talks like that. The thing that has impressed audiences the most is a multi-panel lattice graphic with 2 classification variables and in each panel a scatterplot and a lowess trend line. A single page with 24 small high-resolution histograms also seems to impress people. The nomogram function in the Design package seems to also connect with non-statisticians, as does latex(describe(mydataframe)) using Hmisc. People like seeing in the latex previewer some output that mixes tabular summaries and graphics. Frank Harrell Thomas Schönhoff wrote: Hello, Am Freitag, 25. Februar 2005 22:37 schrieb Dr Carbon: If _you_ were asked to give a 40 minute dog and pony show about R for a group of scientists ranging from physicists to geographers what would you put in? These people want to know what R can do. I'm thinking about something like: A. Overview B. data structures C. arithmetic and manipulation D. reading data E. linear models using glm F. graphics G. programming H. other tricks like rpart or time series analysis? If your audience is well known I would be inclined to target some (simple) examples derived from physics and geography to demonstrate basic ideas of working with R, similar like the ones listed above. Well, 40 minutes are not too long, so I recommend to simplify your presentation as much as you can. You want teach them R in 40 minutes but rather tend to confuse them if you don't shorten your plan a bit. I.E. teaching programming in R in a few minutes for scientists who are not at all acustomed to programming is much overhead, I think. Well, it's up to your estimation on what is expected to follow your presentation. If you are sure that most of them know enough programming to unterstand the basic concepts in R-programming, everything will be fine! If not, I'd recommend to concentrate on basic operations (data structures, arithmetic and manipulation, import/export data and some often used default statistical procedures demonstrating common tasks (is time series analysis important in physics or geography, I don't know??), including some remarks on diffenrences to widespread statistical packages like SPSS or SAS, maybe LispStat. Finally there shouuld be some extended view of available ressources (manuals, FAQ, community) as a starter to learn, use and program R by themselves. I think this would do for a 40 minutes presentation without taking the risk to deter people due to overcomplexity. regards Thomas -- Frank E Harrell Jr Professor and Chair School of Medicine Department of Biostatistics Vanderbilt University __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Re: [R] subset selection for logistic regression
Wittner, Ben wrote: R-packages leaps and subselect implement various methods of selecting best or good subsets of predictor variables for linear regression models, but they do not seem to be applicable to logistic regression models. Does anyone know of software for finding good subsets of predictor variables for linear regression models? Thanks. -Ben Why are these procedures still being used? The performance is known to be bad in almost every sense (see r-help archives). Frank Harrell p.s., The leaps package references Subset Selection in Regression by Alan Miller. On page 2 of the 2nd edition of that text it states the following: All of the models which will be considered in this monograph will be linear; that is they will be linear in the regression coefficients.Though most of the ideas and problems carry over to the fitting of nonlinear models and generalized linear models (particularly the fitting of logistic relationships), the complexity is greatly increased. -- Frank E Harrell Jr Professor and Chair School of Medicine Department of Biostatistics Vanderbilt University __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Re: [R] subset selection for logistic regression
dr mike wrote: -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Wittner, Ben Sent: 02 March 2005 11:33 To: [EMAIL PROTECTED] Subject: [R] subset selection for logistic regression R-packages leaps and subselect implement various methods of selecting best or good subsets of predictor variables for linear regression models, but they do not seem to be applicable to logistic regression models. Does anyone know of software for finding good subsets of predictor variables for linear regression models? Thanks. -Ben p.s., The leaps package references Subset Selection in Regression by Alan Miller. On page 2 of the 2nd edition of that text it states the following: All of the models which will be considered in this monograph will be linear; that is they will be linear in the regression coefficients.Though most of the ideas and problems carry over to the fitting of nonlinear models and generalized linear models (particularly the fitting of logistic relationships), the complexity is greatly increased. __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html The LASSO method and the Least Angle Regression method are two such that have both been implemented (efficiently IMHO - only one least squares for all levels of shrinkage IIRC) in the lars package for R of Hastie and Efron. There is a paper by Madigan and Ridgeway that discusses the use of the Least Angle Regresson approach in the context of logistic regression - available for download from Madigan's space at Ruttgers: www.stat.rutgers.edu/~madigan/PAPERS/lars3.pdf HTH Mike Yes things like lasso can help a lot. -- Frank E Harrell Jr Professor and Chair School of Medicine Department of Biostatistics Vanderbilt University __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Re: [R] subset selection for logistic regression
Christian Hennig wrote: Perhaps I should not write it because I will discredit myself with this but... Suppose I have a setup with 100 variables and some 1000 cases and I want to boil down the number of variables to a maximum of 10 for practical reasons even if I lose 10% prediction quality by this (for example because it is expensive to measure all variables on new cases). Is it really so wrong to use a stepwise method? Yes. Read about model uncertainty and bias in models developed using stepwise methods. One exception: if there is a large number of variables with truly zero regression coefficients, and the rest are not very weak, stepwise can sort things out fairly well. But you never know this in advance. Let's say I divide the sample into three parts and do variable selction on the first part, estimation on the second and test on the third part (this solves almost all problems Frank is talking about on p. 56/57 in his excellent book). Is there always a tractable alternative? That's a good way to find out how bad the method is, not to fix the problems inherent in it. Of course it is wrong to interpret the selected variables as the true influences and all others as unrelated, but if I don't do that? If it should really be a taboo to do stepwise variable selection, why are p. 58/59 of Regression Modeling Strategies devoted to how to do it of you must? Stress on if. And note that if you ask what is the optimum alpha for variables to be kept in the model when doing backwards stepdown, it's alpha=1.0. A good compromise is alpha=0.5. See @Article{ste01pro, author = {Steyerberg, Ewout W. and Eijkemans, Marinus J. C. and Harrell, Frank E. and Habbema, J. Dik F.}, title = {Prognostic modeling with logistic regression analysis: {In} search of a sensible strategy in small data sets}, journal = Medical Decision Making, year = 2001, volume = 21, pages = {45-56}, annote = {shrinkage; variable selection; dichotomization of continuous varibles; sign of regression coefficient; calibration; validation} } And on Bert's excellent question about why shrinkage is not used more often, here is our attempt at a remedy: @Article{moo04pen, author = {Moons, K. G. M. and Donders, A. Rogier T. and Steyerberg, E. W. and Harrell, F. E.}, title = {Penalized maximum likelihood estimation to directly adjust diagnostic and prognostic prediction models for overoptimism: a clinical example}, journal = J Clinical Epidemiology, year = 2004, volume = 57, pages = {1262-1270}, annote = {prediction research;overoptimism;overfitting;penalization;bootstrapping;shrinkage} } Frank Please forget my name;-) Christian On Wed, 2 Mar 2005, Berton Gunter wrote: To clarify Frank's remark ... A prominent theme in statistical research over at least the last 25 years (with roots that go back 50 or more, probably) has been the superiority of shrinkage methods over variable selection. I also find it distressing that these ideas have apparently not penetrated much (at all?) into the wider scientific community (but I suppose I shouldn't be surprised -- most scientists still do one factor at a time experiments 80 years after Fisher). Specific incarnations can be found in anything Bayesian, mixed effects models for repeated measures, ridge regression, and the R packages lars and lasso, among others. I would speculate that aside from the usual statistics/science cultural issues, part of the reason for this is that the estimators don't generally come with neat, classical inference procedures: like it or not, many scientists have been conditioned by their Stat 101 courses to expect P values, so in some sense, we are hoisted by our own petard. Just my $.02 -- contrary(and more knowledgeable) opinions welcome. -- Bert Gunter -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Frank E Harrell Jr Sent: Wednesday, March 02, 2005 5:13 AM To: Wittner, Ben Cc: [EMAIL PROTECTED] Subject: Re: [R] subset selection for logistic regression Wittner, Ben wrote: R-packages leaps and subselect implement various methods of selecting best or good subsets of predictor variables for linear regression models, but they do not seem to be applicable to logistic regression models. Does anyone know of software for finding good subsets of predictor variables for linear regression models? Thanks. -Ben Why are these procedures still being used? The performance is known to be bad in almost every sense (see r-help archives). Frank Harrell p.s., The leaps package references Subset Selection in Regression by Alan Miller. On page 2 of the 2nd edition of that text it states the following: All of the models which will be considered in this monograph will be linear; that is they will be linear in the regression coefficients.Though most of the ideas and problems carry over to the fitting of nonlinear models and generalized linear models (particularly
[R] Quantian
Congratulations Dirk on the article in linuxtoday.com today about Quantian. -- Frank E Harrell Jr Professor and Chair School of Medicine Department of Biostatistics Vanderbilt University __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Re: [R] Logistic regression goodness of fit tests
Trevor Wiens wrote: I was unsure of what suitable goodness-of-fit tests existed in R for logistic regression. After searching the R-help archive I found that using the Design models and resid, could be used to calculate this as follows: d - datadist(mydataframe) options(datadist = 'd') fit - lrm(response ~ predictor1 + predictor2..., data=mydataframe, x =T, y=T) resid(fit, 'gof'). I set up a script to first use glm to create models use stepAIC to determine the optimal model. I used this instead of fastbw because I found the AIC values to be completely different and the final models didn't always match. Then my script takes the reduced model formula and recreates it using lrm as above. Now the problem is that for some models I run into an error to which I can find no reference whatsoever on the mailing list or on the web. It is as follows: test.lrm - lrm(cclo ~ elev + aspect + cti_var + planar + feat_div + loamy + sands + sandy + wet + slr_mean, data=datamatrix, x = T, y = T) singular information matrix in lrm.fit (rank= 10 ). Offending variable(s): slr_mean Error in j:(j + params[i] - 1) : NA/NaN argument Now if I add the singularity criterion and make the value smaller than the default of 1E-7 to 1E-9 or 1E-12 which is the default in calibrate, it works. Why is that? Not being a statistician but a biogeographer using regression as a tool, I don't really understand what is happening here. Does changing the tol variable, change how I should interpret goodness-of-fit results or other evaluations of the models created? I've included a summary of the data below (in case it might be helpful) with all variables in the data frame as it was easier than selecting out the ones used in the model. Thanks in advance. T The goodness of fit test only works on prespecified models. It is not valid when stepwise variable selection is used (unless perhaps you use alpha=0.5). -- Frank E Harrell Jr Professor and Chair School of Medicine Department of Biostatistics Vanderbilt University __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Re: [R] Logistic regression goodness of fit tests
Trevor Wiens wrote: On Thu, 10 Mar 2005 16:19:41 -0600 Frank E Harrell Jr [EMAIL PROTECTED] wrote: The goodness of fit test only works on prespecified models. It is not valid when stepwise variable selection is used (unless perhaps you use alpha=0.5). Perhaps I'm blind, but I can't find any reference to an alpha parameter on the help page for stepAIC. It runs fine however when I set the parameter and produces as model with the same right hand variables as without. Can you tell me what it is ? T What I mean is the effective significance level for keeping a variable in the model. Using AIC for one degree of freedom variables is effectively using an alpha of 0.16 if I recall properly. But I hope you got the point that resid(fit,'gof') as with most goodness of fit tests assumes prespecified models. -- Frank E Harrell Jr Professor and Chair School of Medicine Department of Biostatistics Vanderbilt University __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Re: [R] Confidence interval for Tau-a or c-index to compare logistic lrm (binary) models with each other.
Jan Verbesselt wrote: Dear R list, How can confidence interval be derived for e.g. the Tau-a coefficient or the c index (area under ROC curve) such that I can compare the fitted lrm (logistic) models with each other. Is this possible? The aim is to conclude that one model is significantly better than other model (a, b or c). y~a (continu)+ d(catergoric) y~b (continu)+ d(catergoric) y~c (continu)+ d(catergoric) I will look at residual deviance, the AIC, c-index en Tau-a to compare different logistic models (lrm Design package). All extra advice is mostly welcome! Regards, Jan You can only do this if you have an independent test dataset that was never used in model development or coefficient estimation. Given that, look at the rcorrp.cens function in Hmisc. Frank Harrell ___ ir. Jan Verbesselt Research Associate Lab of Geomatics Engineering K.U. Leuven Vital Decosterstraat 102. B-3000 Leuven Belgium Tel: +32-16-329750 Fax: +32-16-329760 http://gloveg.kuleuven.ac.be/ __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html -- Frank E Harrell Jr Professor and Chair School of Medicine Department of Biostatistics Vanderbilt University __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Re: [R] R-squared in Logistic Regression
Johan Stenberg wrote: Dear all, How do I make R show the R-squared (deviance explained by the model) in a logistic regression? Below is how I write my syntax. Basically I want to investigate density-dependence in parasitism of larvae. Note that in the end I perform a F-test because the dispersion factor (residual deviance / residual df) is significantly higher than 1. But how do I make R show the R-squared? Best wishes Johan The proportion of deviance explained has been shown to not be such a good measure. You can use the lrm function in the Design package to get various measures including ROC area (C index), Somers' Dxy and Kendall tau rank correlation, Nagelkerke generalization of R-squared for maximum likelihood-based models (related to Maddala and others). Frank Harrell y-cbind(para,unpara) model-glm(y~log(larvae),binomial) summary(model) Call: glm(formula = y ~ log(larvae), family = binomial) Deviance Residuals: Min 1Q Median 3Q Max -2.0633 -1.6218 -0.1871 0.7907 2.7670 Coefficients: Estimate Std. Error z value Pr(|z|) (Intercept) 1.0025 0.7049 1.422 0.15499 log(larvae) -1.0640 0.3870 -2.749 0.00597 ** (Dispersion parameter for binomial family taken to be 1) Null deviance: 35.981 on 12 degrees of freedom Residual deviance: 27.298 on 11 degrees of freedom AIC: 40.949 Number of Fisher Scoring iterations: 4 anova(model,test=F) Analysis of Deviance Table Model: binomial, link: logit Response: y Terms added sequentially (first to last) Df Deviance Resid. Df Resid. Dev F Pr(F) NULL 12 35.981 log(larvae) 18.68311 27.298 8.6828 0.003212 ** __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html -- Frank E Harrell Jr Professor and Chair School of Medicine Department of Biostatistics Vanderbilt University __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Re: [R] fastbw question
Peter Flom wrote: Hello I am running R 2.0.1 on Windows, I am attempting to use Frank Harrell's 'fastbw' function (from the Design library), but I get an error that the fit was not created with a Design library fitting function; yet when I go to the help for fastbw (and also look in Frank's book Regression Modeling Strategies) it appears that fastbw should work with a model created with lm. Relevant code model.borrow.logols- lm(logborrow~age + sex + racgp + yrseduc + needlchg + gallery + totni + inject + poly(year.of.int,3) + druginj + inj.years + HTLV3) fastbw(model.borrow.logols) The error message was exactly correct. lm is not a Design fitting function. Try ols. -Frank Thanks in advance Peter Peter L. Flom, PhD Assistant Director, Statistics and Data Analysis Core Center for Drug Use and HIV Research National Development and Research Institutes 71 W. 23rd St www.peterflom.com New York, NY 10010 (212) 845-4485 (voice) (917) 438-0894 (fax) __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html -- Frank E Harrell Jr Professor and Chair School of Medicine Department of Biostatistics Vanderbilt University __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Re: [R] question
wim van baarle wrote: Sir, I found your description of the dataset about nodal involvement in prostate cancer. It comes from the book biostatistics casebook. I like to use the dataset for doing logistics regression. Can you tell me where I can find the dataset. Thanks and greetings Wim van Baarle [EMAIL PROTECTED] You didn't say which prostate cancer study that represented. One famous one has data in R form on our web site http://biostat.mc.vanderbilt.edu. -- Frank E Harrell Jr Professor and Chair School of Medicine Department of Biostatistics Vanderbilt University __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
[R] (no answer)
I wish to perform brain surgery this afternoon at 4pm and don't know where to start. My background is the history of great statistician sports legends but I am willing to learn. I know there are courses and numerous books on brain surgery but I don't have the time for those. Please direct me to the appropriate HowTos, and be on standby for solving any problem I may encounter while in the operating room. Some of you might ask for specifics of the case, but that would require my following the posting guide and spending even more time than I am already taking to write this note. I will be out of the office from 1:15pm to 1:25pm today. This information should be valuable to many. I. Ben Fooled Technical University of Nonparametric Multivariate Statistics Slapout, Alabama --- LEGAL NOTICE\ Unless expressly stated otherwise, this messag...{{dropped}} __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Re: [R] exclusion rules for propensity score matchng (pattern rec)
[EMAIL PROTECTED] wrote: Dear R-list, i have 6 different sets of samples. Each sample has about 5000 observations, with each observation comprised of 150 baseline covariates (X), 125 of which are dichotomous. Roughly 20% of the observations in each sample are treatment and the rest are control units. i am doing propensity score matching, i have already estimated propensity scores(predicted probabilities) using logistic regression, and in each sample i am going to have to exclude approximately 100 treated observations for which I cannot find matching control observations (because the scores for these treated units are outside the support of the scores for control units). in each sample, i must identify an exclusion rule that is interpretable on the scale of the X's that excludes these unmatchable treated observations and excludes as FEW of the remaining treated observations as possible. (the reason is that i want to be able to explain, in terms of the Xs, who the individuals are that I making causal inference about.) i've tried some simple stuff over the past few days and nothing's worked. is there an R-package or algorithm, or even estimation strategy that anyone could recommend? (i am really hoping so!) thank you, alexis diamond Exclusion can be based on the non-overlap regions from the propensity. It should not be done in the individual covariate space. I tend to look at the 10th smallest and largest values of propensity for each of the two treatment groups for making the decision. You will need to exclude non-overlap regions whether you use matching or covariate adjustment of propensity but covariate adjustment (using e.g. regression splines in the logit of propensity) is often a better approach once you've been careful about non-overlap. Frank Harrell __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html -- Frank E Harrell Jr Professor and Chair School of Medicine Department of Biostatistics Vanderbilt University __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Re: [R] exclusion rules for propensity score matchng (pattern rec)
Alexis J. Diamond wrote: hi, thanks for the reply to my query about exclusion rules for propensity score matching. Exclusion can be based on the non-overlap regions from the propensity. It should not be done in the individual covariate space. i want a rule inspired by non-overlap in propensity score space, but that binds in the space of the Xs. because i don't really know how to interpret the fact that i've excluded, say, people with scores .87, but i DO know what it means to say that i've excluded people from country XYZ over age Q because i can't find good matches for them. if i make my rule based on Xs, i know who i can and cannot make inference for, and i can explain to other people who are the units that i can and cannot make inference for. after posting to the list last night, i thought of using the RGENOUD package (genetic algorithm) to search over the space of exclusion rules (eg., var 1 = 1, var 2 = 0 var 3 = 1 or 0, var 4 = 0); the loss function associated with a rule should be increasing in # of tr units w/out support excluded and decreasing in # of tr units w/ support excluded. it might be tricky to get the right loss function, and i know this idea is kind of nutty, but it's the only automated search method i could think of. any comments? alexis Use the X space directly will not result in optimum exclusions unless you use a distance function but that will make assumptions. My advice is to use rpart to make a classification rule that approximates the exclusion criteria to some desired degree of accuracy. I.e. use rpart to predict propensity lower cutoff and separately to predict propensity upper cutoff. This just assists in interpretation. Frank I tend to look at the 10th smallest and largest values of propensity for each of the two treatment groups for making the decision. You will need to exclude non-overlap regions whether you use matching or covariate adjustment of propensity but covariate adjustment (using e.g. regression splines in the logit of propensity) is often a better approach once you've been careful about non-overlap. Frank Harrell On Tue, 5 Apr 2005, Frank E Harrell Jr wrote: [EMAIL PROTECTED] wrote: Dear R-list, i have 6 different sets of samples. Each sample has about 5000 observations, with each observation comprised of 150 baseline covariates (X), 125 of which are dichotomous. Roughly 20% of the observations in each sample are treatment and the rest are control units. i am doing propensity score matching, i have already estimated propensity scores(predicted probabilities) using logistic regression, and in each sample i am going to have to exclude approximately 100 treated observations for which I cannot find matching control observations (because the scores for these treated units are outside the support of the scores for control units). in each sample, i must identify an exclusion rule that is interpretable on the scale of the X's that excludes these unmatchable treated observations and excludes as FEW of the remaining treated observations as possible. (the reason is that i want to be able to explain, in terms of the Xs, who the individuals are that I making causal inference about.) i've tried some simple stuff over the past few days and nothing's worked. is there an R-package or algorithm, or even estimation strategy that anyone could recommend? (i am really hoping so!) thank you, alexis diamond -- Frank E Harrell Jr Professor and Chair School of Medicine Department of Biostatistics Vanderbilt University -- Frank E Harrell Jr Professor and Chair School of Medicine Department of Biostatistics Vanderbilt University __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Re: [R] rcorrp.cens
Stefano Mazzuco wrote: Hi R-users, I'm having some problems in using the Hmisc package. I'm estimating a cox ph model and want to test whether the drop in concordance index due to omitting one covariate is significant. I think (but I'm not sure) here are two ways to do that: 1) predict two cox model (the full model and model without the covariate of interest) and estimate the concordance index (i.e. area under the ROC curve) with rcorr.cens for both models, then compute the difference 2) predict the two cox models and estimate directly the difference between the two c-indices using rcorrp.cens. But it seems that the rcorrp.cens gives me the drop of Dxy index. Do you have any hint? Thanks Stefano First of all, any method based on comparing rank concordances loses powers and is discouraged. Likelihood ratio tests (e.g., by embedding a smaller model in a bigger one) are much more powerful. If you must base comparisons on rank concordance (e.g., ROC area=C, Dxy) then rcorrp.cens can work if the sample size is large enough so that uncertainty about regression coefficient estimates may be ignored. rcorrp.cens doesn't give the drop in C; it gives the probability that one model is more concordant with the outcome than another, among pairs of paired predictions. The bootcov function in the Design package has a new version that will output bootstrap replicates of C for a model, and its help file tells you how to use that to compare C for two models. This should only be done to show how low a power such a procedure has. rcporrp is likely to be more powerful than that, but likelihood ratio is what you want. You will find many cases where one model increases C by only 0.02 but it has many more useful (more extreme) predictions. -- Frank E Harrell Jr Professor and Chair School of Medicine Department of Biostatistics Vanderbilt University __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Re: [R] Hmisc + summarize + quantile: Why only quantiles for first variable in data frame?
Kim Milferstedt wrote: Hi Frank Harrell, thanks for the response. I understand your comment but I wasn't able to find (or recognize) an answer on how to tell FUN explicitely to use matrix operations. Would you be so kind and give me an example? Thanks so much, Kim See http://biostat.mc.vanderbilt.edu/SasByMeansExample plus an example in the help file for summary.formula in Hmisc which uses the apply function. summary.formula and summarize are similar in the use of FUN (which summary.formula unfortunately calls 'fun'). Frank Please read the documentation and see the examples. The first argument to summarize is a matrix or vector and if a matrix, FUN must use matrix operations if you want column-by-column results. FH -- Frank E Harrell Jr Professor and Chair School of Medicine Department of Biostatistics Vanderbilt University Kim Milferstedt wrote: Hi, I'm working on a data set that contains a couple of factors and a number of dependent variables. From all of these dependent variables I would like to calculate mean, standard deviation and quantiles. With the function FUN I get all the means and stdev that I want but quantiles are only calculated for the first of the dependent variables (column 8 in the summarize command). What do I have to do differently in order to get all the quantiles that I want? Thanks, Kim sgldm2- read.table(E:/analysistemp/060412_test_data2.txt, header=T) attach(sgldm2) names(sgldm2) FUN - function(x)c(Mean=mean(x,na.rm=TRUE), STDEV=sd(x,na.rm=TRUE), Quantile=quantile(x, probs= c(0.25,0.50,0.75),na.rm=TRUE)) ordering- llist(time_h_f, Distance_f) resALL - summarize(sgldm2[,8:10], ordering, FUN) __ Kim Milferstedt University of Illinois at Urbana-Champaign Department of Civil and Environmental Engineering 4125 Newmark Civil Engineering Building 205 North Mathews Avenue MC 250 Urbana, IL 61801 USA phone: (001) 217 333-9663 fax: (001) 217 333-6968 email: [EMAIL PROTECTED] http://cee.uiuc.edu/research/morgenroth/index.asp ___ -- Frank E Harrell Jr Professor and Chair School of Medicine Department of Biostatistics Vanderbilt University __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Re: [R] Creat new column based on condition
Duncan Murdoch wrote: On 4/21/2006 4:05 PM, Sachin J wrote: Hi, How can I accomplish this task in R? V1 10 20 30 10 10 20 Create a new column V2 such that: If V1 = 10 then V2 = 4 If V1 = 20 then V2 = 6 V1 = 30 then V2 = 10 Gabor's solution is fine; something that looks a little bit more like your code is this: V2 - NA V2 - ifelse( V1 == 10, 4, V2) V2 - ifelse( V1 == 20, 6, V2) V2 - ifelse( V1 == 30, 10, V2) or V2 - 4*(V1==10)+6*(V2==20)+10*(V2==30) V2[V2==0] - NA Frank or V2 - ifelse( V1 == 10, 4, ifelse( V1 == 20, 6, ifelse( V1 == 30, 10, NA ))) (where the NA is to handle any unexpected case where V1 isn't 10, 20 or 30). My preference would be to use just one assignment, and if I was sure 10, 20 and 30 were the only possibilities, would use V2 - ifelse( V1 == 10, 4, ifelse( V1 == 20, 6, 10 )) Duncan Murdoch So the O/P looks like this V1 V2 10 4 20 6 30 10 10 4 10 4 20 6 Thanks in advance. Sachin -- Frank E Harrell Jr Professor and Chair School of Medicine Department of Biostatistics Vanderbilt University __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Re: [R] regression modeling
Berton Gunter wrote: May I offer a perhaps contrary perspective on this. Statistical **theory** tells us that the precision of estimates improves as sample size increases. However, in practice, this is not always the case. The reason is that it can take time to collect that extra data, and things change over time. So the very definition of what one is measuring, the measurement technology by which it is measured (think about estimating tumor size or disease incidence or underemployment, for example), the presence or absence of known or unknown large systematic effects, and so forth may change in unknown ways. This defeats, or at least complicates, the fundamental assumption that one is sampling from a (fixed) population or stable (e.g. homogeneous, stationary) process, so it's no wonder that all statistical bets are off. Of course, sometimes the necessary information to account for these issues is present, and appropriate (but often complex) statistical analyses can be performed. But not always. Thus, I am suspicious, cynical even, about those who advocate collecting all the data and subjecting the whole vast heterogeneous mess to arcane and ever more computer intensive (and adjustable parameter ridden) data mining algorithms to detect trends or discover knowledge. To me, it sounds like a prescription for turning on all the equipment and waiting to see what happens in the science lab instead of performing careful, well-designed experiments. I realize, of course, that there are many perfectly legitimate areas of scientific research, from geophysics to evolutionary biology to sociology, where one cannot (easily) perform planned experiments. But my point is that good science demands that in all circumstances, and especially when one accumulates and attempts to aggregata data taken over spans of time and space, one needs to beware of oversimplification, including statistical oversimplification. So interrogate the measurement, be skeptical of stability, expect inconsistency. While all models are wrong but some are useful (George Box), the second law tells us that entropy still rules. (Needless to say, public or private contrary views are welcome). -- Bert Gunter Genentech Non-Clinical Statistics South San Francisco, CA Bert raises some great points. Ignoring the important issues of doing good research and stability in the meaning of data as time marches on, it is generally true that the larger the sample size the greater the complexity of the model we can afford to fit, and the better the fit of the model. This is the AIC school. The BIC school assumes there is an actual model out there waiting for us, of finite dimension, and the complexity of our models should not grow very fast as N increases. I find the AIC approach gives me more accurate predictions. -- Frank E Harrell Jr Professor and Chair School of Medicine Department of Biostatistics Vanderbilt University __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Re: [R] stepwise regression
Jinsong Zhao wrote: Dear all, I have encountered a problem when perform stepwise regression. You have more problems than you know. The dataset have more 9 independent variables, but 7 observation. Why collect any data? You can get great fits using random numbers using this procedure. Frank In R, before performing stepwise, a lm object should be given. fm - lm(y ~ X1 + X2 + X3 + X11 + X22 + X33 + X12 + X13 + X23) However, summary(fm) will give: Residual standard error: NaN on 0 degrees of freedom Multiple R-Squared: 1, Adjusted R-squared: NaN F-statistic: NaN on 6 and 0 DF, p-value: NA In this situation, step() or stepAIC() will not give any useful information. I don't know why SAS could deal with this situation: PROC REG; MODEL y=X1 X2 X3 X11 X22 X33 X12 X13 X23/SELECTION=STEPWISE; RUN; Any help will be really appreciated. Wishes, Jinsong Zhao -- Frank E Harrell Jr Professor and Chair School of Medicine Department of Biostatistics Vanderbilt University __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Re: [R] function for linear regression with White std. errors
Thomas Lumley wrote: On Thu, 27 Apr 2006, Brian Quinif wrote: John, Thanks for the suggestion, but tomorrow I am teaching a little seminar for my department trying to convince people about how wonderful R is. These people are all Stata users, and they really like the idea that they only have to type , robust to get het. consistent std. errors. This really is a unique feature of Stata. It's not hard to add the standard errors to glm(), but it is harder to make other functions such as predict() use them properly. The survey package gives these standard errors by default, but that might also be regarded as too much work. The Design package also does this: f - fittingfunction( ) g - robcov(f, ...) plot(g); summary(g) etc uses robust s.e. Frank -thomas Thomas Lumley Assoc. Professor, Biostatistics [EMAIL PROTECTED] University of Washington, Seattle __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html -- Frank E Harrell Jr Professor and Chair School of Medicine Department of Biostatistics Vanderbilt University __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Re: [R] New-user support package - suggestions?
Søren Højsgaard wrote: Dear Andrew, I tend to agree with Uwe. A perhaps more useful approach for achieving your goal would be to create a video introduction to R. On http://www.learnvisualstudio.net/ you can find examples of such introductory videos for programming, for example in C#. I've experimented a little myself with creating such videos for getting started with R, i.e. installing R, installing the Tinn-R editor, using the help facilities, finding things on CRAN. It is not as easy to do as I had thought... (If anyone is interested in the result, please drop me a line and I'll send you a link).. There are various relevant pieces of software around, for example camtasia (commercial) and camstudio (freeware).. Such a set of videos could make it onto CRAN... Best Søren But please choose an editor that works on all platforms. Frank Fra: [EMAIL PROTECTED] på vegne af Uwe Ligges Sendt: to 04-05-2006 09:13 Til: Andrew Robinson Cc: R-Help Discussion Emne: Re: [R] New-user support package - suggestions? Andrew Robinson wrote: Dear Community, This is largely a repost with some new information. I'm interested in developing a package that could ease the command-line learning curve for new users. It would provide more detailed syntax checking and commentary as feedback. It would try to anticipate common new-user errors, and provide feedback to help correct them. Re. anticipate, see install.packages(fortunes) library(fortunes) fortune(anticipate) As a trivial example, instead of mean(c(1,2,NA)) [1] NA we might have mean(c(1,2,NA)) [1] NA Warning: your data contains missing values. If you wish to ignore these for the purposes of computing the mean, use the argument: na.rm=TRUE. Attention. If you give help like this, you will implicitly teach users not to read the manuals but trust on R's warning/error messages and suggestions. Some students won't ask the question why do my data contain NAs but will start using na.rm=TRUE, even if the error was in a prior step and no NAs were expected at all. I'm interested in any thoughts that people have about this idea - what errors do you commonly see, and how can they be dealt with? I have funding for 6 weeks of programming support for this idea. All suggestions are welcome. Cheers, Andrew Your project sounds very ambitious. I anticipate you will be disappointed after 6 weeks of programming, because you won't have achieved very much. I'd rather try to spend 6 weeks of time for some more promising projects... Just my first thoughts, just go on with your project if you are convinced. Best, Uwe Ligges __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html -- Frank E Harrell Jr Professor and Chair School of Medicine Department of Biostatistics Vanderbilt University __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Re: [R] data manipulation docs
Federico Calboli wrote: Hi All, Is there some document/manual about data manipulation within R that I could use as a reference (obviously, aside the R manuals)? The reason I am asking is that I have a number of data frames/matrices containg genetic data. The data is in a character form, as in: V1 V2 V3 V4 V5 1 AA AG AA GG AG 2 AC AA AA GG AG 3 AA AG AA GG AG 4 AA AA AA GG AG 5 AA AA AA GG AA I need, to chop, subset, and variously manipulate this kind of data, sometimes keeping the data in its character format, sometimes converting it to numeric form (i.e. substitute each data point with the equivalent factor value). Since the data is ofthe quite big, I have to keep things memory efficient. This whole game is getting excedingly time consuming and frustrating, because I end up with random pieces of code that I save, patching a particular problem, but difficult to be 'abstracted' for a new task, so I get back close to square one annoyingly often. Cheers, Federico Calboli There is a large data manipulation section on the Alzola Harrell document available on CRAN under contributed docs, or a slightly more up to date version at biostat.mc.vanderbilt.edu -- Frank E Harrell Jr Professor and Chair School of Medicine Department of Biostatistics Vanderbilt University __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Re: [R] devide data into decile
Uwe Ligges wrote: Guojun Zhu wrote: I guess this is really basic. But I do not find an answer yet. I have a big data.frame. I would like to divede them into 10 deciles accounding to one of its member. Then I need a number for each decile with some computaion within each group. How to devide it? For example, the result of cut() as a new variable to the data.frame and afterwards split() the data.frame by the resulting factor. Or library(Hmisc) xg - cut2(x, g=10)# labels deciles nicely Frank Uwe Ligges -- Frank E Harrell Jr Professor and Chair School of Medicine Department of Biostatistics Vanderbilt University __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Re: [R] lmer, p-values and all that
Douglas Bates wrote: Users are often surprised and alarmed that the summary of a linear . . . . Doug, I have been needing this kind of explanation. That is very helpful. Thank you. I do a lot with penalized MLEs for ordinary regression and logistic models and know that getting sensible P-values is not straightforward even in that far simpler situation. Frank -- Frank E Harrell Jr Professor and Chair School of Medicine Department of Biostatistics Vanderbilt University __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Re: [R] odd feature
ivo welch wrote: Dear R wizards: I just got stung by the ifelse() feature. a - 10:15 b - 20:300 test - 1 ifelse(test,a,b) [1] 10 I had not realized that this was the default behavior---I had expected 10:15. mea culpa. however, I wonder whether it would make sense to replace ifelse with a different semantic, where if test is a single scalar, it means what a stupid user like me would imagine. Aside, I like the flexibility of R, but I am not thrilled by all the recycling rules. I either mean I want a scalar or a vector of equal/appropriate dimension. I never want a recycle of a smaller vector. (I do often use a recycle of a scalar.) regards, /iaw The current behavior is logical to me. Are you looking for if(test) a else b? Frank -- Frank E Harrell Jr Professor and Chair School of Medicine Department of Biostatistics Vanderbilt University __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Re: [R] Can lmer() fit a multilevel model embedded in a regression?
a multilevel model embedded in a regression? I would like to fit a hierarchical regression model from Witte et al. (1994; see reference below). It's a logistic regression of a health outcome on quntities of food intake; the linear predictor has the form, X*beta + W*gamma, where X is a matrix of consumption of 82 foods (i.e., the rows of X represent people in the study, the columns represent different foods, and X_ij is the amount of food j eaten by person i); and W is a matrix of some other predictors (sex, age, ...). The second stage of the model is a regression of X on some food-level predictors. Is it possible to fit this model in (the current version of) lmer()? The challenge is that the persons are _not_ nested within food items, so it is not a simple multilevel structure. We're planning to write a Gibbs sampler and fit the model directly, but it would be convenient to be able to flt in lmer() as well to check. Andrew --- Reference: Witte, J. S., Greenland, S., Hale, R. W., and Bird, C. L. (1994). Hierarchical regression analysis applied to a study of multiple dietary exposures and breast cancer. Epidemiology 5, 612-621. -- Andrew Gelman Professor, Department of Statistics Professor, Department of Political Science [EMAIL PROTECTED] www.stat.columbia.edu/~gelman Statistics department office: Social Work Bldg (Amsterdam Ave at 122 St), Room 1016 212-851-2142 Political Science department office: International Affairs Bldg (Amsterdam Ave at 118 St), Room 731 212-854-7075 Mailing address: 1255 Amsterdam Ave, Room 1016 Columbia University New York, NY 10027-5904 212-851-2142 (fax) 212-851-2164 __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html -- Andrew Gelman Professor, Department of Statistics Professor, Department of Political Science [EMAIL PROTECTED] www.stat.columbia.edu/~gelman Statistics department office: Social Work Bldg (Amsterdam Ave at 122 St), Room 1016 212-851-2142 Political Science department office: International Affairs Bldg (Amsterdam Ave at 118 St), Room 731 212-854-7075 Mailing address: 1255 Amsterdam Ave, Room 1016 Columbia University New York, NY 10027-5904 212-851-2142 (fax) 212-851-2164 -- Frank E Harrell Jr Professor and Chair School of Medicine Department of Biostatistics Vanderbilt University __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Re: [R] Can lmer() fit a multilevel model embedded in a regression?
Andrew Gelman wrote: Frank, Just to check: are you saying that this model can be fit in lmer()? Thanks. Andrew Not sure about that. Greenland probably used BUGS but doesn't mean you can't do it with lmer. Frank Frank E Harrell Jr wrote: A great reference for that kind of model, plus a way to 'connect' foods via a composition matrix is author = {Greenland, Sander}, title ={When should epidemiologic regressions use random coeff icients?}, journal = Biometrics, year = 2000, volume = 56, pages ={915-921}, -- Frank E Harrell Jr Professor and Chair School of Medicine Department of Biostatistics Vanderbilt University __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Re: [R] odd feature
Duncan Murdoch wrote: On 5/22/2006 3:26 AM, Martin Maechler wrote: Gabor == Gabor Grothendieck [EMAIL PROTECTED] on Sun, 21 May 2006 09:47:07 -0400 writes: Gabor If you know that test is a scalar Gabor result - if (test) a else b Gabor will do it. Yes, indeed! IMO, ifelse(test, a, b) is much overused where as if(test) a else b is much UNDER used. From some e-mail postings, and even some documents (even printed books?), I get the impression that too many people think that ifelse(.,.,.) is to be used as expression / function and if(.) . else . only for program flow control. This leads to quite suboptimal code, and I personally use if(.) . else . __as expression__ much more frequently than ifelse(.,.,.) For overuse of ifelse(), do you mean cases where test is length 1, so if() would work? Or are you thinking of something else? I'd also be interested in what you mean by quite suboptimal code. Are you thinking of things like if (test) temp - a else temp - b result - f(temp) versus result - f( if (test) a else b ) ? I would generally use the former, because it's easier to get the formatting right, and I find it easier to read. It's suboptimal in speed and memory use because of creating the temp variable, but in most cases I think that would be such a small difference that the small increase in readability is worthwhile. IMHO that approach too verbose and not more readable. Frank Duncan Murdoch Martin Maechler, ETH Zurich. Gabor Here is another approach: Gabor as.vector(test * ts(a) + (!test) * ts(b)) Gabor On 5/21/06, ivo welch [EMAIL PROTECTED] wrote: Dear R wizards: I just got stung by the ifelse() feature. a - 10:15 b - 20:300 test - 1 ifelse(test,a,b) [1] 10 I had not realized that this was the default behavior---I had expected 10:15. mea culpa. however, I wonder whether it would make sense to replace ifelse with a different semantic, where if test is a single scalar, it means what a stupid user like me would imagine. Aside, I like the flexibility of R, but I am not thrilled by all the recycling rules. I either mean I want a scalar or a vector of equal/appropriate dimension. I never want a recycle of a smaller vector. (I do often use a recycle of a scalar.) regards, /iaw __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html -- Frank E Harrell Jr Professor and Chair School of Medicine Department of Biostatistics Vanderbilt University __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Re: [R] Ordinal Independent Variables
Rick Bilonick wrote: When I run lrm from the Design package, I get a warning about contrasts when I include an ordinal variable: Warning message: Variable ordfac is an ordered factor. You should set options(contrasts=c(contr.treatment,contr.treatment)) or Design will not work properly. in: Design(eval(m, sys.parent())) I don't get this message if I use glm with family=binomial. It produces linear and quadratic contrasts. If it's improper to do this for an ordinal variable, why does glm not balk? Rick B. Standard regression methods don't make good use of ordinal predictors and just have to treat them as categorical. Design is a bit picky about this. If the predictor has numeric scores for the categories, you can get a test of adequacy of the scores (with k-2 d.f. for k categories) by using scored(predictor) in the formula. Or just create a factor( ) variable to hand to Design. -- Frank E Harrell Jr Professor and Chair School of Medicine Department of Biostatistics Vanderbilt University __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Re: [R] Ordinal Independent Variables
Prof Brian Ripley wrote: On Mon, 22 May 2006, Frank E Harrell Jr wrote: Rick Bilonick wrote: When I run lrm from the Design package, I get a warning about contrasts when I include an ordinal variable: Warning message: Variable ordfac is an ordered factor. You should set options(contrasts=c(contr.treatment,contr.treatment)) or Design will not work properly. in: Design(eval(m, sys.parent())) I don't get this message if I use glm with family=binomial. It produces linear and quadratic contrasts. If it's improper to do this for an ordinal variable, why does glm not balk? Rick B. Standard regression methods don't make good use of ordinal predictors and just have to treat them as categorical. Design is a bit picky about this. If the predictor has numeric scores for the categories, you can get a test of adequacy of the scores (with k-2 d.f. for k categories) by using scored(predictor) in the formula. Or just create a factor( ) variable to hand to Design. Contrasts in S/R are used to set the coding of factors, and model.matrix() does IMO 'make good use of ordinal predictors'. I don't know what is meant by 'Standard regression methods': the charitable interpretation is that these are the overly restrictive methods used by certain statistical packages. (I first learnt of the use of polynomial codings for ordinal factors in the late 1970s, when I first learnt anything about ANOVA, so to me they are 'standard'.) So are you saying this is a design deficiency in package Design, or that the authors of S ca 1991 were wrong to allow arbitrary contrasts? Brian, What I meant was that unlike the case of ordinal response varables where multiple intercepts in logistical models do not cost degrees of freedom because the ordering constraint is fully utilized, ordinal predictors require k-1 degrees of freedom for k levels using any standard contrast. Special methods (e.g. pool adjacent violators to impose a monotonicity constraint) would have to be used to get a lot out of the ordinal nature of the predictor. There's nothing wrong with allowing arbitrary contrasts; more progress has been made in statistics for ordinal responses than ordinal predictors. Frank -- Frank E Harrell Jr Professor and Chair School of Medicine Department of Biostatistics Vanderbilt University __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Re: [R] position of number at risk in survplot() graphs
Osman Al-Radi wrote: Dear R-help How can one get survplot() to place the number at risk just below the survival curve as opposed to the default which is just above the x-axis? I tried the code bellow but the result is not satisfactory as some numbers are repeated several times at different y coordinates and the position of the n.risk numbers corresponds to the x-axis tick marks not the survival curve time of censoring. n - 20 set.seed(731) cens - 15*runif(n) h - .02*exp(2) dt - -log(runif(n))/h label(dt) - 'Follow-up Time' e - ifelse(dt = cens,1,0) dt - pmin(dt, cens) units(dt) - Year S - Surv(dt,e) km-survfit(S~1) survplot(km,n.risk=T,conf='none', y.n.risk=unique(summary(km)$surv)) Any suggestion on addressing this problem would be apprecited. Also, is there a way to add a tick mark to the survival curve at times of censoring similar to the mark.time=T argument in plot.survplot()? Thanks Osman Osman, y.n.risk has to be a scalar and gives the y-coordinate of the bottom line of number at risk. I take it that you want the numbers not all at the same height. This will require a customization of survplot or fetching information from the km object and using text( ). Frank -- Frank E Harrell Jr Professor and Chair School of Medicine Department of Biostatistics Vanderbilt University __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Re: [R] SPSS variable lables import
Thomas Lumley wrote: On Tue, 6 Jun 2006, Frank Thomas wrote: Hi, I try to get the variable labels of a SPSS data file into R but don't find this mentioned in the help file for foreign. Is there another way to get them ? BTW: An SPSS variable name is like: VAR001, whereas the variable label might be 'Identification no.' mydata - read.spss(mydata.sav) attr(mydata, variable.labels) -thomas Or: library(Hmisc) mydata - spss.get('mydata.sav') describe(mydata) contents(mydata) xYplot( ) etc. use the variable labels attached to individual variables. You can retrieve them also this way: with(mydata, plot(x, y, xlab=label(x), ylab=label(y))) Frank Harrell Thomas Lumley Assoc. Professor, Biostatistics [EMAIL PROTECTED] University of Washington, Seattle __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html -- Frank E Harrell Jr Professor and Chair School of Medicine Department of Biostatistics Vanderbilt University __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Re: [R] modeling logit(y/n) using lrm
Hamilton, Cody wrote: After a little digging, it turns out that fit.mult.impute will allow fitter = glm, so previous suggestions regarding modeling cbind(y,n) as an outcome will work fine. Thanks! Also lrm can easily handle your setup using the weights argument. Frank Cody Hamilton, Ph.D Institute for Health Care Research and Improvement Baylor Health Care System (214) 265-3618 This e-mail, facsimile, or letter and any files or attachments transmitted with it contains information that is confidential and privileged. This information is intended only for the use of the individual(s) and entity(ies) to whom it is addressed. If you are the intended recipient, further disclosures are prohibited without proper authorization. If you are not the intended recipient, any disclosure, copying, printing, or use of this information is strictly prohibited and possibly a violation of federal or state law and regulations. If you have received this information in error, please notify Baylor Health Care System immediately at 1-866-402-1661 or via e-mail at [EMAIL PROTECTED] Baylor Health Care System, its subsidiaries, and affiliates hereby claim all applicable privileges related to this information. [[alternative HTML version deleted]] __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html -- Frank E Harrell Jr Professor and Chair School of Medicine Department of Biostatistics Vanderbilt University __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
[R] useR! Thanks
After attending my first useR! conference I want to thank the organizers for doing a wonderful job and the presenters for their high quality presentations and stimulating ideas. The conference venue was excellent and of course Vienna is one of the greatest cities in the world to visit. useR! is one of the most fun conferences I've attended. Thanks again! -- Frank E Harrell Jr Professor and Chair School of Medicine Department of Biostatistics Vanderbilt University __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Re: [R] latex function with lm
Erin Hodgess wrote: Dear R People: I have used the latex function from the Hmisc package and it is just great! However, I have a new question regarding that function: is there an option for summary(lm(y~x)), please? There are options for different types of objects, but I didn't see one for that. Maybe I just missed it. There is no latex method for summary(lm); contributions welcomed. You might also look at latex.ols, latex.anova.Design, latex.summary.Design. Frank Thanks in advance! R for Windows Version 2.3.1 Sincerely, Erin Hodgess Associate Professor Department of Computer and Mathematical Sciences University of Houston - Downtown mailto: [EMAIL PROTECTED] __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html -- Frank E Harrell Jr Professor and Chair School of Medicine Department of Biostatistics Vanderbilt University __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Re: [R] effect size
Matthew Bridgman wrote: Does anyone know a simple way of calculating effect sizes? Thanks MB Yes - the following formula is simple and fairly universal: 2 :-) -- Frank E Harrell Jr Professor and Chair School of Medicine Department of Biostatistics Vanderbilt University __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Re: [R] As.Factor with Logistic Regression
Justin Rapp wrote: I am modeling the probability of player succeeding in the NFL with a binomial logistic regression with 1 signifying success and 0 signifying no success. I performed the regression of the binomial variable against overall draft position using the college conference for which each player played as a factor using the as.factor(Conference) command. My question is: How do I plot specific factors against the curve for the entire set. There are only a few factors that have significant coefficients and I would like to plot those against the logistic curve for the entire set. Any help would be greatly appreciated. It will bias the analysis to omit insignificant factors. To get the plots you want: library(Hmisc); library(Design) dd - datadist(mydata); options(datadist='dd') f - lrm(y ~ x1+x2+) plot(f, x1=NA, fun=plogis) plot(f, x2=NA, fun=plogis) plot(f, fun=plogis) # plot partial effects of all predictors, probability scale -- Frank E Harrell Jr Professor and Chair School of Medicine Department of Biostatistics Vanderbilt University __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Re: [R] R Reporting - PDF/HTML mature presentation quality package?
zubin wrote: Hello, searching for the key packages so i can output, Text, Tables ,and Graphics into a HTML or PDF report - have R create these reports in an easy and efficient way without LaTeX - I have searched the R pages but don't see any mature packages - anyone have any advice on a easy to use R package one can use for generating publication quality reports? Outputting HTML or PDF. Doing this without LaTeX is like doing statistical analysis without linear models and the Wilcoxon test. -- Frank E Harrell Jr Professor and Chair School of Medicine Department of Biostatistics Vanderbilt University __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html