[R] X11 font
I get the following error out of R, on a newer Ubuntu installation. Error in `axis()`: ! X11 font -adobe-helvetica-%s-%s-*-*-%d-*-*-*-*-*-*-*, face 1 at size 12 could not be loaded Backtrace: 1. graphics::matplot(...) 3. graphics::plot.default(...) 4. graphics (local) localAxis(...) 6. graphics:::Axis.default(...) 7. graphics::axis(side = side, at = at, labels = labels, ...) The context is pretty simple: library(survival) matplot(60:100, 36525* survexp.us[61:101, 1:2, "2014"], col=2:1, lty=1, lwd=2, xlab="Age", ylab="Death rate per 100", log='y', type='l', yaxt='n', main="US Death Rates") axis(2, c(1,2,5,10,20, 50), c(1,2,5,10, 20, 50), las=2) This code works fine on my screen. The error comes up when I put it into an .Rmd file and apply rmarkdown::render() to it. Likely some font file is needed, but what one? Terry Version: R Under development (unstable) (2023-08-01 r84818) -- "Unsuffered Consequences" Copyright (C) 2023 The R Foundation for Statistical Computing Platform: aarch64-unknown-linux-gnu -- Terry M Therneau, PhD Department of Quantitative Health Sciences Mayo Clinic thern...@mayo.edu "TERR-ree THUR-noh" [[alternative HTML version deleted]] __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] extract from a list of lists
Thanks everyone for prodding my gray matter, which seems to be getting stiffer as I approach 70 (< 90 days). -- I'll continue to use the $ or [[ forms. That will suffice. -- I thought there might be a base R variant, e.g. something like extract( list, element-name); probably cross talk in my brain from the rstan library -- Gregg's note shows such a function in purr. But I rather like having as few dependencies as possible in a package, one usage is normally not enough, at least for something this simple. Terry T. On 12/27/22 14:38, Bert Gunter wrote: > Well, I prefer Greg's approach, but if you want to avoid calls to $ or > `[[` then you could do: > > unlist(fits)[ rep(names(fits[[1]]) == 'iter', length(fits))] > > > Cheers, > Bert > > On Tue, Dec 27, 2022 at 9:46 AM Greg Snow<538...@gmail.com> wrote: >> Another option is the map family of functions in the purrr package >> (yes, this depends on another package being loaded, which may affect >> things if you are including this in your own package, creating a >> dependency). >> >> In map and friends, if the "function" is a string or integer, then it >> is taken as the piece to be extracted, so you should be able to do >> something like: >> >> library(purrr) >> map(fits, 'iter') >> # or >> map_int(fits, 'iter') >> # or >> map_dbl(fits, 'iter') >> >> which of the last 2 to use depends on how `iter` is stored. >> >> On Tue, Dec 27, 2022 at 10:16 AM Therneau, Terry M., Ph.D. via R-help >> wrote: >>> I not uncommonly have the following paradym >>> fits <- lapply(argument, function) >>> >>> resulting in a list of function results. Often, the outer call is to >>> mclapply, and the >>> function encodes some long calculation, e.g. multiple chains in an MCMC. >>> Assume for illustration that each function returns a list with elements >>> beta, loglik, iter. >>> >>> Then sapply(fits, function(x) x$iter) >>> will give me a vector, with the number of iterations used by each instance. >>> >>> I've often been suspicious that there is some simple shorthand for the >>> "grab all the >>> elements named iter" that skips the explicit x$iter function. Am I indeed >>> overlooking >>> something?I don't expect a speed increase, just cleaner code. >>> >>> Terry T. >>> >>> -- >>> Terry M Therneau, PhD >>> Department of Quantitative Health Sciences >>> Mayo Clinic >>> thern...@mayo.edu >>> >>> "TERR-ree THUR-noh" >>> >>> [[alternative HTML version deleted]] >>> >>> __ >>> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see >>> https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fstat.ethz.ch%2Fmailman%2Flistinfo%2Fr-help=05%7C01%7Ctherneau%40mayo.edu%7Cebbda887f4b94223376608dae84a58ae%7Ca25fff9c3f634fb29a8ad9bdd0321f9a%7C0%7C0%7C638077703266213025%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C=wksTglkeQKmMSWWKHLsUijH5A25cH%2FuwxSNBOeNr9Sg%3D=0 >>> PLEASE do read the posting >>> guidehttps://nam12.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.r-project.org%2Fposting-guide.html=05%7C01%7Ctherneau%40mayo.edu%7Cebbda887f4b94223376608dae84a58ae%7Ca25fff9c3f634fb29a8ad9bdd0321f9a%7C0%7C0%7C638077703266213025%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C=9GASHCZVgnzNqAUrWimS7UphiqjYLOf50M1qWv6jeN0%3D=0 >>> and provide commented, minimal, self-contained, reproducible code. >> >> >> -- >> Gregory (Greg) L. Snow Ph.D. >> 538...@gmail.com >> >> __ >> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see >> https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fstat.ethz.ch%2Fmailman%2Flistinfo%2Fr-help=05%7C01%7Ctherneau%40mayo.edu%7Cebbda887f4b94223376608dae84a58ae%7Ca25fff9c3f634fb29a8ad9bdd0321f9a%7C0%7C0%7C638077703266213025%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C=wksTglkeQKmMSWWKHLsUijH5A25cH%2FuwxSNBOeNr9Sg%3D=0 >> PLEASE do read the posting >> guidehttps://nam12.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.r-project.org%2Fposting-guide.html=05%7C01%7Ctherneau%40mayo.edu%7Cebbda887f4b94223376608dae84a58ae%7Ca25fff9c3f634fb29a8ad9bdd0321f9a%7C0%7C0%7C638077703266213025%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C=9GASHCZVgnzNqAUrWimS7UphiqjYLOf50M1qWv6jeN0%3D=0 >> and provide commented, minimal, self-contained, reproducible code. [[alternative HTML version deleted]] __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] MCMC tools
Is there a convenient package that computes standard covergence summaries for and MCMC run? This is something that I likely knew once and have now forgotton. More detail: I'm trying to understand the MCMC done by a particular model called Subtype and Stage Inference (SuStaIn), suffice it to say that I am cautious of some details. The model doesn't fit into the standard packages, so I've set it up and run my own Metropolis chains. I don't want to expend energy also creating R-hat, ESS, and other sensible summaries; even more so to find the inevitable programming mistakes I'll make if I create them myself. For the truly curious. I have measurments of tau depostion for 86 brain regions (43 * left/right) of interest from 1925 scans. One hypothesis in dementia research is that tau deposition occurs over time, in a pattern; there is likely more than one pattern. The algorithm is looking a permutations of the 86 regions, in search of a small collection that best fits all the subjects. There is a general background of statistical work that has shown that ranking is a hard problem, in the sense of having a small variance for individual ranks. SuStaIn tends to give small variances. Terry T. -- Terry M Therneau, PhD Department of Quantitative Health Sciences Mayo Clinic thern...@mayo.edu "TERR-ree THUR-noh" [[alternative HTML version deleted]] __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] extract from a list of lists
I not uncommonly have the following paradym fits <- lapply(argument, function) resulting in a list of function results. Often, the outer call is to mclapply, and the function encodes some long calculation, e.g. multiple chains in an MCMC. Assume for illustration that each function returns a list with elements beta, loglik, iter. Then sapply(fits, function(x) x$iter) will give me a vector, with the number of iterations used by each instance. I've often been suspicious that there is some simple shorthand for the "grab all the elements named iter" that skips the explicit x$iter function. Am I indeed overlooking something? I don't expect a speed increase, just cleaner code. Terry T. -- Terry M Therneau, PhD Department of Quantitative Health Sciences Mayo Clinic thern...@mayo.edu "TERR-ree THUR-noh" [[alternative HTML version deleted]] __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] how to find "first" or "last" record after sort in R
I prefer the duplicated() function, since the final code will be clear to a future reader. (Particularly when I am that future reader). last <- !duplicated(mydata$ID, fromLast=TRUE) # point to the last ID for each subject mydata$data3[last] <- NA Terry T. (I read the list once a day in digest form, so am always a late reply.) On 9/10/21 5:00 AM, r-help-requ...@r-project.org wrote: Hello List, Please look at the sample data frame below: ID date1 date2 date3 1 2015-10-08 2015-12-17 2015-07-23 2 2016-01-16 NA 2015-10-08 3 2016-08-01 NA 2017-01-10 3 2017-01-10 NA 2016-01-16 4 2016-01-19 2016-02-24 2016-08-01 5 2016-03-01 2016-03-10 2016-01-19 This data frame was sorted by ID and date1. I need to set the column date3 as missing for the "last" record for each ID. In the sample data set, the ID 1, 2, 4 and 5 has one row only, so they can be consider as first and last records. the data3 can be set as missing. But the ID 3 has 2 rows. Since I sorted the data by ID and date1, the ID=3 and date1=2017-01-10 should be the last record only. I need to set date3=NA for this row only. the question is, how can I identify the "last" record and set it as NA in date3 column. Thank you, Kai [[alternative HTML version deleted]] __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] coxph means not equal to means of model matrix
On 9/3/21 12:59 PM, Bond, Stephen wrote: > > I looked at the nocenter and it says (-1,0,1) values but it seems that any > three-level > factor is included in that (represented as 1,2,3 in R) . > A factor is turned into a set of 0/1 dummy variable, so the nocenter applies.� I will add more clarification to the documentation. > Also, is the baseline curve now showing the reference level and not the > fictional .428 > sex? If I predict the risk for a new row, should I multiply the coefficient > shown in the > output by 1 for a sex=1? It used to be (1-.428)*coef. > Yes, the "mean" component is the reference level for predict and survfit.� If I could go back in time it would be labeled as "reference" instead of "mean".�� Another opportunity for me to make the documentation clearer. Good questions, � Terry T > Thanks for clarifying. > > SB > > *From:* Therneau, Terry M., Ph.D. > *Sent:* Friday, 3 September, 2021 12:37 > *To:* Bond, Stephen > *Cc:* R-help > *Subject:* Re: coxph means not equal to means of model matrix > > [EXTERNAL] > > -- > > See ?coxph, in particular the new "nocenter" option. > > Basically, the "mean" component is used to center later computations.� This > can be > critical for continuous variables, avoiding overflow in the exp function, but > is not > necessary for 0/1 covariates.�� The fact that the default survival curve > would be for a > sex of .453, say, was off-putting to many. > > Terry T. > > On 9/3/21 11:01 AM, Bond, Stephen wrote: > > Hi, > > Please, help me understand what is happening with the means of a Cox > model? > > I have: > > R version 4.0.2 (2020-06-22) -- "Taking Off Again" > > Copyright (C) 2020 The R Foundation for Statistical Computing > > Platform: x86_64-w64-mingw32/x64 (64-bit) > > getOption("contrasts") > > ��� unordered ordered > > "contr.treatment" "contr.poly" > > According to the help �coxph.object has a component holding the means of > the X > (model.matrix). This does not hold any more. > > ``` > > library(survival) > > test1 <- list(time=c(4,3,1,1,2,2,3), > > ���status=c(1,1,1,0,1,1,0), > > ���x=c(0,2,1,1,1,0,0), > > ���sex=factor(c(0,0,0,0,1,1,1))) > > m1 <- coxph(Surv(time, status) ~ x + sex, test1) > > m1$means > > ##��� x� sex1 > > ## 0.7142857 0.000 > > colMeans(model.matrix(m1)) > > ## x� sex1 > > ## 0.7142857 0.4285714 > > ``` > > Will new observations be scored using the zero mean from the object?? Is > this just a > reporting change where $means shows the reference level and no longer the > mean of > the model matrix?? > > Thanks everybody > > ATTENTION : This email originated outside your organization. Exercise caution > before > clicking links, opening attachments, or responding with personal information. > > -- [[alternative HTML version deleted]] __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] coxph means not equal to means of model matrix
See ?coxph, in particular the new "nocenter" option. Basically, the "mean" component is used to center later computations.� This can be critical for continuous variables, avoiding overflow in the exp function, but is not necessary for 0/1 covariates.�� The fact that the default survival curve would be for a sex of .453, say, was off-putting to many. Terry T. On 9/3/21 11:01 AM, Bond, Stephen wrote: > > Hi, > > Please, help me understand what is happening with the means of a Cox model? > > I have: > > R version 4.0.2 (2020-06-22) -- "Taking Off Again" > > Copyright (C) 2020 The R Foundation for Statistical Computing > > Platform: x86_64-w64-mingw32/x64 (64-bit) > > getOption("contrasts") > > ��� unordered�� ordered > > "contr.treatment"� "contr.poly" > > According to the help �coxph.object has a component holding the means of the > X > (model.matrix). This does not hold any more. > > ``` > > library(survival) > > test1 <- list(time=c(4,3,1,1,2,2,3), > > ���status=c(1,1,1,0,1,1,0), > > ���x=c(0,2,1,1,1,0,0), > > ���sex=factor(c(0,0,0,0,1,1,1))) > > m1 <- coxph(Surv(time, status) ~ x + sex, test1) > > m1$means > > ##��� x� sex1 > > ## 0.7142857 0.000 > > colMeans(model.matrix(m1)) > > ## x� sex1 > > ## 0.7142857 0.4285714 > > ``` > > Will new observations be scored using the zero mean from the object?? Is this > just a > reporting change where $means shows the reference level and no longer the > mean of the > model matrix?? > > Thanks everybody > [[alternative HTML version deleted]] __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] syvcoxph and cox.zph for testing the PH assumption
On 7/11/21 5:00 AM, r-help-requ...@r-project.org wrote: Hello, is it kosher to call cox.zph on a syvcoxph model fit? I see that someone proposed a modified version of cox.zph that uses resid(fit, 'schoenfeld', **weighted=TRUE**). https://stats.stackexchange.com/questions/265307/assessing-proportional-hazards-assumption-of-a-cox-model-with-caseweights Is that all it takes? Thanks, Youyi The cox.zph function does a formal score test. No, it does not account for robust variance. I hadn't considered that case, but will now think about it. It is quite easy to show that there is a problem: just give everyone a weight of 100. The stackexchange conversation was new to me. The solution there won't work with the current code, which does not make use of resid(). It has been updated to do the proper score test, the older version of cox.zph, which they modified, used an approximation. Terry T. __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] inverse of the methods function
Is there a complement to the methods function, that will list all the defined methods for a class? One solution is to look directly at the NAMESPACE file, for the package that defines it, and parse out the entries. I was looking for something built-in, i.e., easier. -- Terry M Therneau, PhD Department of Health Science Research Mayo Clinic thern...@mayo.edu "TERR-ree THUR-noh" [[alternative HTML version deleted]] __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] evil attributes
I wrote: "I confess to being puzzled WHY the R core has decided on this definition..." After just a little more thought let me answer my own question. a. The as.vector() function is designed to strip off everything extraneous and leave just the core. (I have a mental image of Jack Webb saying "Just the facts ma'am"). I myself use it freqently in the test suite for survival, in cases where I'm checking the corrent numeric result and don't care about any attached names. b. is.vector(x) essentially answers the question "does x look like a result of as.vector?" Nevertheless I understand Roger's confusion. -- Terry M Therneau, PhD Department of Quantitative Health Sciences Mayo Clinic thern...@mayo.edu "TERR-ree THUR-noh" [[alternative HTML version deleted]] __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] "exact" p-values
I am late to this discussion -- I read R-help as a once-a-day summary. A few comments. 1. In the gene-discovery subfield of statistics (SNP studies, etc.) there is a huge multiple-testing problem. In defense, the field thinks in terms of thresholds like 1e-5 or 1e-10 rather than the .05 or .01 most of us are used to. In that literature, they do care about 1e-16 vs 1e-20. We can all argue about whether that is a sensible approach or not, but it is what it is. I think that this is the context of the journal's request, i.e., they want the actual number, however you calculate it. My own opinion is that these rarified p-values are an arbitrary scale, one that no longer has a probability interpretation. For the central limit theorem to be correct that far from the mean requires a sample size that is beyond imagination (`number of atoms in the earth' order of size). Such a scale may still be useful, but it's not really a probability. 2. The label of "Fisher's exact test" has caused decades of confusion. In this context the word means "a particular test whose distribution can be completely enumerated": it does not mean either "correct" or "precise". The original enumeration methods had limitations with resspect to the sample size or the presence of complications such as tied values; from the discussion so far it would appear that the 'exact' argument of wilcox.test uses such a method. Cyrus Mehta did nice work on improved algorithms that do not have these restrictions, methods that have been refiined and expanded in the software offerings from Cytel among others. Perhaps someone could update R's code to use this, but see 3 below. My own opinion is that permutation tests are an important tool, one "wrench" in our statistical toolbox. But they are only one tool out of many. I am quite put off by arguments that purposefully conflate "exact" and "correct". 3. The concordance statistic C, the Wilcoxon test, and Somer's d are all the same statistic, just written a little differently. (Somer's d is essentially Kendalls' tau, but with a slightly different rule for ties). A test for C=.5 is the same as a Wilcoxon. For a binary response C = the area under the reciever operating curve (AUC). The concordance command in the surivival library computes this statistic for continuous, binary, or censored responses. The variance is based on a jackknife argument, and is computed by organizing the data into a binary tree structure, very similar to the methods used by Mehta, is efficient for large n and is valid for ties. Perhaps add a link in the wilcox.test help page? Footnote: AUC is a special case of C but not vice versa. People sometimes try to extend AUC to the other data types, but IMHO with only moderate success. -- Terry M Therneau, PhD Department of Health Science Research Mayo Clinic thern...@mayo.edu "TERR-ree THUR-noh" [[alternative HTML version deleted]] __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] temporary clipping
In one of my plot functions it is convenient to use clipping to restrict the range of some output. But at the end of the function I'd like to turn it off, i.e., so that a subsequent use of legend(), text() or whatever is not affected. I don't quite see how to do this -- it seems that the only way to turn off clip is with a new plot. -- Terry M Therneau, PhD Department of Health Science Research Mayo Clinic thern...@mayo.edu "TERR-ree THUR-noh" [[alternative HTML version deleted]] __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] "chi-square" | "chi-squared" | "chi squared" | "chi, square"
Martin, A fun question. Looking back at my oldest books, Feller (1950) used chi-square. Then I walked down the hall to our little statistics library and looked at Johnson and Kotz, "Continous Univariate Distributions", since each chapter therein has comments about the history of the distribution. a. They use 'chi-square' throughout their history section, tracing the distribution back to work in the 1800s. But, those earliest papers apparently didn't name their results as chi- whatever, so an "origin" story didn't pan out. b. They have 13 pages of references, and for fun I counted the occurence of variants. The majority of papers don't have the word in the title at all and the next most common is the Greek symbol. Here are the years of the others: chi-square: 73 43 65 80 86 73 82 73 69 69 78 64 64 86 65 86 82 82 76 82 88 81 74 77 87 86 93 69 60 88 88 80 77 41 59 79 31 chi-squared: 72 76 82 83 89 79 69 67 77 78 69 77 83 88 87 89 78 chi: 92 73 89 87 chi-squares: 77 83 chi-bar-square: 91 There doesn't look to be a trend over time. The 1922 Fisher reference uses the Greek symbol, by the way. Terry T [[alternative HTML version deleted]] __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] conflicting results for a time-varying coefficient in a Cox model
This is an excellent question. The answer, in this particular case, mostly has to do with the outlier time values. (I've never been convinced that the death at time 999 isn't really a misplaced code for "missing", actually). If you change the knots used by the spline you can get quite different values. For instance, using a smaller data set: fit1 <- coxph(Surv(tstart, time, status) ~ trt + prior + karno, veteran) zph1 <- cox.zph(fit1, transform='identity') plot(zph1[3]) dtime <- unique(veteran$time[veteran$status ==1]) # all of the death times veteran2 <- survSplit( Surv(time, status) ~ ., data=veteran, cut=dtime) fit2 <- coxph(Surv(tstart, time, status) ~ trt + prior + karno + karno:ns(time, df=4), data=veteran2) tx <- 0:100 * 10 # x positions for plot ncall <- attr(terms(fit2), "predvars")[[6]] ty <- eval(ncall, data.frame(time = tx)) %*% coef(fit2)[4:7] + coef(fit2)[3] lines(tx, ty, col=2) - Now it looks even worse! The only difference is that the ns() function has chosen a different set of knots. The test used by the cox.zph function is based on a score test and is solid. The plot that it produces uses a smoothed approximation to the variance matrix and is approximate. So the diagnostic plot will never exactly match an actual fit. In this data set the outliers exacerbate the issue. To see this try a different time scale. zph2 <- cox.zph(fit1, transform= sqrt) plot(zph2[3]) veteran2$stime <- sqrt(veteran2$time) fit3 <- coxph(Surv(tstart, time, status) ~ trt + prior + karno + karno:ns(stime, df=4), data=veteran2) ncall3 <-attr(terms(fit3), "predvars")[[6]] ty3 <- eval(ncall3, data.frame(stime= sqrt(tx))) %*% coef(fit3)[4:7] + coef(fit3)[3] lines(sqrt(tx), ty3, col=2) The right tail is now better behaved. Eliminating the points >900 makes things even better behaved. Terry T. On 8/8/19 9:07 AM, Ferenci Tamas wrote: > I was thinking of two possible ways to > plot a time-varying coefficient in a Cox model. > > One is simply to use survival::plot.cox.zph which directly produces a > beta(t) vs t diagram. > > The other is to transform the dataset to counting process format and > manually include an interaction with time, expanded with spline (to be > similar to plot.cox.zph). Plotting the coefficient produces the needed > beta(t) vs t diagram. > > I understand that they're slightly different approaches, so I don't > expect totally identical results, but nevertheless, they approximate > the very same thing, so I do expect that the results are more or less > similar. > > However: > > library( survival ) > library( splines ) > > data( veteran ) > > zp <- cox.zph( coxph(Surv(time, status) ~ trt + prior + karno, > data = veteran ), transform = "identity" )[ 3 ] > > veteran3 <- survSplit( Surv(time, status) ~ trt + prior + karno, > data = veteran, cut = 1:max(veteran$time) ) > > fit <- coxph(Surv(tstart,time, status) ~ trt + prior + karno + > karno:ns( time, df = 4 ), data = veteran3 ) > cf <- coef( fit ) > nsvet <- ns( veteran3$time, df = 4 ) > > plot( zp ) > lines( 0:1000, ns( 0:1000, df = 4, knots = attr( nsvet, "knots" ), > Boundary.knots = attr( nsvet, "Boundary.knots" ) )%*%cf[ > grep( "karno:ns", names( cf ) ) ] + cf["karno"], > type = "l", col = "red" ) > > Where is the mistake? Something must be going on here, because the > plots are vastly different... > > Thank you very much in advance, > Tamas [[alternative HTML version deleted]] __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Trying to understand the magic of lm (Still trying)
John, The text below is cut out of a "how to write a package" course I gave at the R conference in Vanderbilt. I need to find a home for the course notes, because it had a lot of tidbits that are not well explained in the R documentation. Terry T. Model frames: One of the first tasks of any modeling routine is to construct a special data frame containing the covariates that will be used, via a call to the model.frame function. The code to do this is found in many routines, and can be a little opaque on first view. The obvious code would be \begin{verbatim} coxph <- function(formula, data, weights, subset, na.action, init, control, ties= c("efron", "breslow", "exact"), singular.ok =TRUE, robust=FALSE, model=FALSE, x=FALSE, y=TRUE, tt, method=ties, ...) { mf <- model.frame(formula, data, subset, weights, na.action) \end{verbatim} since those are the coxph arguments that are passed forward to the model.frame routine. However, this simple approach will fail with a ``not found'' error message if any of the data, subset, weights, etc. arguments are missing. Programs have to take the slightly more complicated approach of constructing a call. \begin{verbatim} Call <- match.call() indx <- match(c("formula", "data", "weights", "subset", "na.action"), names(Call), nomatch=0) if (indx[1] ==0) stop("A formula argument is required") temp <- Call[c(1,indx)] # only keep the arguments we wanted temp[[1]] <- as.name('model.frame') # change the function called mf <- eval(temp, parent.frame()) Y <- model.response(mf) etc. \end{verbatim} We start with a copy of the call to the program, which we want to save anyway as documentation in the output object. Then subscripting is used to extract only the portions of the call that we want, saving the result in a temporary. This is based on the fact that a call object can be viewed as a list whose first element is the name of the function to call, followed by the arguments to the call. Note the use of \code{nomatch=0}; if any arguments on the list are missing they will then be missing in \code{temp}, without generating an error message. The \mycode{temp} variable will contain a object of type ``call'', which is an unevaluated call to a routine. Finally, the name of the function to be called is changed from ``coxph'' to ``model.frame'' and the call is evaluated. In many of the core routines the result is stored in a variable ``m''. This is a horribly short and non-descriptive name. (The above used mf which isn't a much better.) Many routines also use ``m'' for the temporary variable leading to \code{m <- eval(m, parent.frame())}, but I think that is unnecessarily confusing. The list of names in the match call will include all arguments that should be evaluated within context of the named dataframe. This can include more than the list above, the survfit routine for instance has an optional argument ``id'' that names an identifying variable (several rows of the data may represent a single subject), and this is included along with ``formula'' etc in the list of choices in the match function. The order of names in the list makes no difference. The id is later retrieved with \code{model.extract(m, 'id')}, which will be NULL if the argument was not supplied. At the time that coxph was written I had not caught on to this fact and thought that all variables that came from a data frame had to be represented in the formula somehow, thus the use of \code{cluster(id)} as part of the formula, in order to denote a grouping variable. On 5/11/19 5:00 AM, r-help-requ...@r-project.org wrote: > A number of people have helped me in my mission to understand how lm (and > other fucntions) are able to pass a dataframe and then refer to a specific > column in the dataframe. I thank everyone who has responded. I now know a bit > about deparse(substitute(xx)), but I still don't fully understand how it > works. The program below attempts to print a column of a dataframe from a > function whose parameters include the dataframe (df) and the column requested > (col). The program works fine until the last print statement were I receive > an error, Error in `[.data.frame`(df, , col) : object 'y' not found . I hope > someone can explain to me (1) why my code does not work, and (2) what I can > do to fix it. [[alternative HTML version deleted]] __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] adding a hex sticker to a package
I've created a hex sticker for survival. How should that be added to the package directory? It's temporarily in man/figures on the github page. Terry T. (Actually, the idea was from Ryan Lennon. I liked it, and we found someone with actual graphical skills to execute it. ) [[alternative HTML version deleted]] __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] weighed Fleming-Harrington log rank test
You are correct that the survreg routine only supports 'rho' of the Fleming-Harrington G-rho tests. This is a function of age -- I wrote the original code back when I was working with Tom (Fleming), and he was only using 1 parameter. Later he and David expanded the test to two parameters. This might be only the second request for the feature in the 30+ years since that date. Terry Therneau [[alternative HTML version deleted]] __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Time-dependent coefficients in a Cox model with categorical variants
First, as others have said please obey the mailing list rules and turn of First, as others have said please obey the mailing list rules and turn off html, not everyone uses an html email client. Here is your code, formatted and with line numbers added. I also fixed one error: "y" should be "status". 1. fit0 <- coxph(Surv(futime, status) ~ x1 + x2 + x3, data = data0) 2. p <- log(predict(fit0, newdata = data1, type = "expected")) 3. lp <- predict(fit0, newdata = data1, type = "lp") 4. logbase <- p - lp 5. fit1 <- glm(status ~ offset(p), family = poisson, data = data1) 6. fit2 <- glm(status~ lp + offset(logbase), family = poisson, data = data1) 7. group <- cut(lp, c(-Inf, quantile(lp, (1:9) / 10), Inf)) 8. fit3 <- glm(status ~ -1 + group + offset(p), family = poisson, data = data1) The key idea of the paper you referenced is that the counterpart to the Hosmer-Lemishow test (wrong if used directly in a Cox model) is to look at the predicted values from a Cox model as input to a Poisson regression. That means adding the expected from the Cox model as a fixed term in the Poisson. And like any other poisson that means offset(log(expected)) as a term. The presence of time dependent covariates does nothing to change this, per se, since expected for time fixed is the same as for time varying. In practice it does matter, at least philosophically. Lines 1, 2, 5 do this just fine. If data1 is not the same as data0, a new study say, then the test for intercept=0 from fit1 is a test of overall calibration. Models like line 8 try to partition out where any differences actually lie. The time-dependent covariates part lies in the fact that a single subject may be represented by multiple lines in data0 and/or data1. Do you want to collapse that person into a single row before the glm fits? If subject "Jones" is represented by 15 lines in the data and "Smith" by 2, it does seem a bit unfair to give Jones 15 observations in the glm fit. But full discussion of this is as much philosophy as statistics, and is perhaps best done over a beer. Terry T. From: Max Shell [archerr...@gmail.com] Sent: Wednesday, January 17, 2018 10:25 AM To: Therneau, Terry M., Ph.D. Subject: Re: Time-dependent coefficients in a Cox model with categorical variants Assessing calibration of Cox model with time-dependent coefficients<https://stats.stackexchange.com/questions/323569/assessing-calibration-of-cox-model-with-time-dependent-coefficients> I am trying to find methods for testing and visualizing calibration to Cox models with time-depended coefficients. I have read your nice article<http://journals.sagepub.com/doi/10.1177/0962280213497434>. In this paper, we can fit three models: fit0 <- coxph(Surv(futime, status) ~ x1 + x2 + x3, data = data0) p <- log(predict(fit0, newdata = data1, type = "expected")) lp <- predict(fit0, newdata = data1, type = "lp") logbase <- p - lp fit1 <- glm(y ~ offset(p), family = poisson, data = data1) fit2 <- glm(y ~ lp + offset(logbase), family = poisson, data = data1) group <- cut(lp, c(-Inf, quantile(lp, (1:9) / 10), Inf)) fit3 <- glm(y ~ -1 + group + offset(p), family = poisson, data = data1) Here$B!$(BI simplely use data1$B!!(B<- data0[1:500,] First, I get following error when running line 5. Error in eval(predvars, data, env) : object 'y' not found So I modifited the code by replacing the y as status looks like this: fit1 <- glm(status ~ offset(p), family = poisson, data = data1) fit2 <- glm(status ~ lp + offset(logbase), family = poisson, data = data1) group <- cut(lp, c(-Inf, quantile(lp, (1:9) / 10), Inf)) fit3 <- glm(status ~ -1 + group + offset(p), family = poisson, data = data1) Is this replacing correct? Second, I try to introduce the time-transform use coxph with ttparament. My code is: fit0 <- coxph(Surv(time, status) ~ x1 + x2 + x3 + tt(x3), data = data0, function(x, t, ...) x * t) p <- log(predict(fit0, newdata = data1, type = "expected")) lp <- predict(fit0, newdata = data1, type = "lp") logbase <- p - lp fit1 <- glm(status ~ offset(p), family = poisson, data = data1) fit2 <- glm(status ~ lp + offset(logbase), family = poisson, data = data1) group <- cut(lp, c(-Inf, quantile(lp, (1:9) / 10), Inf)) fit3 <- glm(status ~ -1 + group + offset(p), family = poisson, data = data1) My questions is: * Is the code above correct? * How to interpret the fit1, fit2, fit3? What's the connection between the three models and the calibration of the Cox model? * How to generate the calibration plot using fit3? The article dose have a section discuss this, but no code is provided. Thank you! On Mon, Jan 15, 2018 at 9:23 PM, Therneau, Terry M., Ph.D. <thern...@mayo.edu<mailto:thern...@mayo.edu>&g
[R] help matching rows of a data frame
This question likely has a 1 line answer, I'm just not seeing it. (2, 3, or 10 lines is fine too.) For a vector I can do group <- match(x, unqiue(x)) to get a vector that labels each element of x. What is an equivalent if x is a data frame? The result does not have to be fast: the data set will have < 100 elements. Since this is inside the survival package, and that package is on the 'recommended' list, I can't depend on any package outside the recommended list. Terry T. __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Odd results from rpart classification tree
You are mixing up two of the steps in rpart. 1: how to find the best candidate split and 2: evaluation of that split. With the "class" method we use the information or Gini criteria for step 1. The code finds a worthwhile candidate split at 0.5 using exactly the calculations you outline. For step 2 the criteria is the "decision theory" loss. In your data the estimated rate is 0 for the left node and 15/45 = .333 for the right node. As a decision rule both predict y=0 (since both are < 1/2). The split predicts 0 on the left and 0 on the right, so does nothing. The CART book (Brieman, Freidman, Olshen and Stone) on which rpart is based highlights the difference between odds-regression (for which the final prediction is a percent, and error is Gini) and classification. For the former treat y as continuous. Terry T. On 05/15/2017 05:00 AM, r-help-requ...@r-project.org wrote: The following code produces a tree with only a root. However, clearly the tree with a split at x=0.5 is better. rpart doesn't seem to want to produce it. Running the following produces a tree with only root. y <- c(rep(0,65),rep(1,15),rep(0,20)) x <- c(rep(0,70),rep(1,30)) f <- rpart(y ~ x, method='class', minsplit=1, cp=0.0001, parms=list(split='gini')) Computing the improvement for a split at x=0.5 manually: obs_L <- y[x<.5] obs_R <- y[x>.5] n_L <- sum(x<.5) n_R <- sum(x>.5) gini <- function(p) {sum(p*(1-p))} impurity_root <- gini(prop.table(table(y))) impurity_L <- gini(prop.table(table(obs_L))) impurity_R <- gini(prop.table(table(obs_R))) impurity <- impurity_root * n - (n_L*impurity_L + n_R*impurity_R) # 2.880952 Thus, an improvement of 2.88 should result in a split. It does not. Why? Jonathan __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] finegray function in the survival package
Your question is much like "what is in my pocket?" since the only possible answer is "how could I know?". You have some block of R code that produce the stated error, but we don't know what it is. One wild guess is that a death time which is <=0 will produce this error since that would lead to an invalid time interval. Please look at the "Asking for help" section of https://www.r-project.org/help.html, and in particular the link on reproducable examples. You need to give an example, with data, that reproduces the problem. Terry T. (PS, I agree that the error message is too terse. I will try to make it better.) On 05/03/2017 05:00 AM, r-help-requ...@r-project.org wrote: Hello R-team, I am trying to use the tmerge function from survvial library. My data is similar to mgus2 dataset from R. But, I get the following message. Error in tmerge(dat, dat, id = Number, death = event(dYears, death), BMF = event(ptemp, : tstart must be > tstop Could you help me? Thanks, Ahalya. On Wed, Sep 7, 2016 at 8:29 AM, Therneau, Terry M., Ph.D. <thern...@mayo.edu wrote: On 09/07/2016 05:00 AM, r-help-requ...@r-project.org wrote: Dear R-Team, I have been trying to use the finegray routine that creates a special data so that Fine and Gray model can be fit. However, it does not seem to work. Could you please help me with this issue? Thanks, Ahalya. You have given us no details of your example code that "doesn't work", and I can't read your mind. So no, we can't help. Give us a hint. Terry T __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] survival package can't find Ccoxfit6
On 04/27/2017 09:53 AM, sesh...@mskcc.org wrote: Thank you Drs. Therneau and Murdoch. "Why not use coxph.fit?" -- My use case scenario is that I needed the Cox model coefficients for resampled data. I was trying to reduce the computational overhead of coxph.fit (since it will repeated a large number of times) by stripping all the parts that I don't need such as sorting of the data prior to coxfit6.c call and Martingale residual and concordance computations after the parameters are estimated. That is an interesting use case which I had not thought about. The first question is just how much slower coxph.fit is than the stripped down version (I'd guess about 1/2 but that is just a guess), and whether that makes a real difference to your code. If it is spending 10% of its time in the coxph calculation a change to 5% isn't that much, but 90% is something else. The next is what is the main impediment (I'd guess concordance, but again just a guess.) Perhaps I could add concordance= and/or resid= flags to the fitting routine. Under the R v3.4.0 model one cannot create any modified form of coxph.fit and expect it to work. Worse yet is the following where I copy "coxph.fit" to my workspace as "mycoxph.fit" (works initially because the environment is namespace:survival and fails when environment changed to R_GlobalEnv) If you were under linux another solution would be to grab the source from github, add your routine to the R/ directory, then R CMD build followed by R CMD INSTALL. Macintosh is essentially as easy, though you need to install Xcode for the compilers. The compile toolchain for windows is outside my ken. Let's keep talking. Terry T. __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] survival package can't find Ccoxfit6
Let me summarize rather than repeat the entire thread: An error report from a user (seshan) stumped me, and I asked for help here. Duncan Murdoch picked up on fine details of the error message, i.e., that the error did NOT come from within the survival package. That changes the whole tenor of the discussion. Indeed, the user has their own function "phcoefs" that directly calls one of my internal C routines. As of R 3.4, this can only be done for routines that I explicitly export. I don't export coxfit6.c. Where to go from here? 1. I'm not against exporting a routine, but I'm not going to do it without a discussion. Doing so is more work for me: I'd need to write a test routine in order to ensure long-term reliability of the export, and it ties my hands wrt future changes. In this partiuclar case, why not use coxph.fit? 2. One of the design goals for the survival package is to make it usable as a component for other's work. For instance all of the return structures are easily inspected (no S4 classes) and most are carefully documented e.g. help(coxph.object). The core computations of coxph are split out into separate functions coxph.fit and agreg.fit, so that they can be called directly without the formula and argument checking overhead. Ditto for survreg, survifit, survdiff and concordance. Given the number of other packages that depend on survival I have been at least moderately successful at this aim. (To be honest this is not entirely alturism on my part as it stops the near infinite requests to add one 'just one more thing' to the package.) This also means I am open to modifying a routine or exporting a call -- if you can make a good argument. 3. I need a good way to document this for the survival package. Yet one more chapter in my 1/2 written book. Someday... 4. Calling another package's C routines is dangerous, and one of the goals of the 3.4 namespace changes was to stop this from happening willy-nilly. The new error messages look like success. Though it means that I'm getting multiple "not found" emails. Terry T. __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] survival package can't find Ccoxfit6
A user contacted me directly about this, I answered with my best understanding of the recent R-help discussion of the issue, and their response to my response shows that I'm not quite right. I am emphatically not an MS Windows user so am asking for help -- which I will cut/paste to this user and to the next dozen who will invariably contact me directly. Thanks, Terry Therneau Forwarded Message Subject: RE: survival package Date: Wed, 26 Apr 2017 18:05:30 + From: sesh...@mskcc.org To: Therneau, Terry M., Ph.D. <thern...@mayo.edu> Thank you for the quick response. The session info command for v3.4.0 does in fact report survival_2.41-3. Furthermore, while both v3.3.1 and v3.40 are on the same computer the library paths do not have any directory in common: .libPaths() [1] "C:/Program Files/R/R-3.4.0/library" and .libPaths() [1] "C:/Program Files/R/R-3.3.1/library" Thanks, Venkat -Original Message----- From: Therneau, Terry M., Ph.D. [mailto:thern...@mayo.edu] Sent: Wednesday, April 26, 2017 1:42 PM To: Seshan, Venkatraman E./Epidemiology-Biostatistics Subject: Re: survival package This has been discussed in R-help by multiple people. You have a pre-3.4 version of the survival package somewhere on your search path, and the method for resolving .C calls has changed. The sessionInfo command should report survival version 2.41-3. Terry T. On 04/26/2017 12:17 PM, sesh...@mskcc.org wrote: Dear Prof. Therneau, I am encountering an error message when I try to use the coxfit6 routine from the survival package under the 3.4.0 version of R. The minimal function and the script are in the attached file. This function worked under earlier versions of R. -- - *** ** Works under R-3.3.1 ** *** source("coxfit6-issue.R") [1] -0.4838181 sessionInfo() R version 3.3.1 (2016-06-21) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows 7 x64 (build 7601) Service Pack 1 locale: [1] LC_COLLATE=English_United States.1252 [2] LC_CTYPE=English_United States.1252 [3] LC_MONETARY=English_United States.1252 [4] LC_NUMERIC=C [5] LC_TIME=English_United States.1252 attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] survival_2.39-4 loaded via a namespace (and not attached): [1] Matrix_1.2-6splines_3.3.1 grid_3.3.1 lattice_0.20-33 -- - *** ** Does not work under R-3.4.0 ** *** library(survival) source("coxfit6-issue.R") Error in .Call("Ccoxfit6", as.integer(control$iter.max), stime, as.integer(sstat), : "Ccoxfit6" not available for .Call() for package "survival" sessionInfo() R version 3.4.0 (2017-04-21) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows 7 x64 (build 7601) Service Pack 1 Matrix products: default locale: [1] LC_COLLATE=English_United States.1252 [2] LC_CTYPE=English_United States.1252 [3] LC_MONETARY=English_United States.1252 [4] LC_NUMERIC=C [5] LC_TIME=English_United States.1252 attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] survival_2.41-3 loaded via a namespace (and not attached): [1] compiler_3.4.0 Matrix_1.2-9splines_3.4.0 grid_3.4.0 [5] lattice_0.20-35 -- - When I remove the quotes surrounding Ccoxfit6 in the function both versions give the error: Error in phcoefs(stim[ii], sts[ii], as.matrix(as.double(cvt[ii])), oo$coefficients, : object 'Ccoxfit6' not found I would greatly appreciate your help in resolving this. Thanks, Venkat Seshan = Please note that this e-mail and any files transmitted from Memorial Sloan Kettering Cancer Center may be privileged, confidential, and protected from disclosure under applicable law. If the reader of this message is not the intended recipient, or an employee or agent responsible for delivering this message to the intended recipient, you are hereby notified that any reading, dissemination, distribution, copying, or other use of this communication or any of its attachments is strictly prohibited. If you have received this communication in error, please notify the sender immediately by replying to this message and deleting this message, any attachments, and all copies and backups from your computer. __ R-help@r-project.org mailing list -- To UNSUBSCRI
[R] finding out if a method exists
I'm thinking of adding a new "cmatrix" function/method to the survival package but before I do I'd like to find out if any other packages already use this function name. The obvious method is to look at the NAMESPACE file for each package in CRAN and read the export list. This is the kind of task for which someone, somewhere will have written routines. I just don't know who or where. Any hints? Terry T. __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] R: How to prune using holdout data
Let me give an outline of how to answer Alfredo's question via an example. I will split the data set "lung" into two peices. For these subjects with advanced lung cancer the physician's assessment of ECOG performance status (ph.ecog) is one of the most powerful indicators of outcome. Try to predict it from other variables. library(survival) # for the test data set library(rpart) data1 <- lung[1:125,] data2 <- lung[126:228,] rfit1 <- rpart(ph.ecog ~ ., data=data1) printcp(rfit1) CP nsplit rel error xerror xstd 1 0.565788 0 1.0 1.04037 0.100516 2 0.098516 1 0.43421 0.44906 0.045001 3 0.042708 2 0.33570 0.35134 0.041692 4 0.031032 3 0.29299 0.37610 0.042971 5 0.019949 4 0.26196 0.37753 0.044692 6 0.01 5 0.24201 0.39166 0.050332 # Validate using data2. First get predictions for each of the pruned trees cpvalues <- rfit1$cptable[,1] pmat <- matrix(0, nrow(data2), length(cpvalues)) for (i in 1:length(cpvalues)) pmat[,i] <- predict(prune(rfit1, cpvalues[i]), newdata=data2) Now, we need to decide what on a measure of error. Try simple squared error. error <- colMeans((data2$ph.ecog - pmat)^2) round(error, 3) [1] 0.493 0.280 0.210 0.225 0.186 0.198 This is simple, but other cases are more complex. The performace score is actually an integer from 0-4 (5= dead), see http://ecog-acrin.org/resources/ecog-performance-status table(lung$ph.ecog) 0 1 2 3 63 113 50 1 Suppose instead we fit a model and treat the response as categorical? The total number of nested models is a bit smaller. rfit2 <- rpart(ph.ecog ~ ., data=data1, method="class") printcp(rfit2) CP nsplit rel error xerror xstd 1 0.35938 0 1.0 1.0 0.086951 2 0.12500 1 0.64062 0.64062 0.081854 3 0.06250 2 0.51562 0.70312 0.083662 4 0.03125 4 0.39062 0.57812 0.079610 5 0.01000 5 0.35938 0.56250 0.078977 predict(rfit2, newdata=data2)[1:5,] 0 1 2 3 126 0.03125 0.9375 0.03125 0 127 0.03125 0.9375 0.03125 0 128 0.03125 0.9375 0.03125 0 129 0.03125 0.9375 0.03125 0 130 0.37500 0.6250 0.0 0 Now, we can ask for predicted probabilities for each class (default), which is a vector of length 4 for each subject, or for the predicted class, which is a single value. Which do we want, and then what is the best measure of prediction error? If three subjects with value 0 had prediction class vectors of (.8, .2, 0, 0), (.8, .1, .1, 0) and (.45, .25, .2, .1), one outlook would say they all are the same (all pick 0 as the best), others would give them different errors. Is the second prediction worse than the first? What if the single subject with ph.ecog=3 had ended up in the validation data set; how should we judge their prediction? This complexity is one reason that there is not a simple function for "validation" with a new data set. On 02/27/2017 09:48 AM, Alfredo wrote: Thank you, Terry, for your answer. I’ll try to explain better my question. When you create a classification or regression tree you first grow a tree based on a splitting criteria: this usually results in a large tree that provides a good fit to the training data. The problem with this tree is its potential for overfitting the data: the tree can be tailored too specifically to the training data and not generalize well to new data. The solution (apart cross-validation) is to find a smaller subtree that results in a low error rate on *holdout or validation data.* Hope it helps to clarity my question. Best, Alfredo -Messaggio originale- Da: Therneau, Terry M., Ph.D. [mailto:thern...@mayo.edu] You will need to give more detail of exactly what you mean by "prune using a validation set". THe prune.rpart function will prune at any value you want, what I suspect you are looking for is to compute the error of each possible tree, using a validation data set, then find the best one, and then prune there. How do you define "best"? __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] How to prune using a validation set
You will need to give more detail of exactly what you mean by "prune using a validation set". THe prune.rpart function will prune at any value you want, what I suspect you are looking for is to compute the error of each possible tree, using a validation data set, then find the best one, and then prune there. How do you define "best"? Terry Therneau On 02/26/2017 05:00 AM, r-help-requ...@r-project.org wrote: I'd like to use a different data ( validation) set for pruning my classification tree. Unfortunately there aren't arguments to get this in prune.rpart(). Any suggestions? Thanks! Alfredo __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Can mice() handle crr()? Fine-Gray model
Look at the finegray command within the survival package; the competing risks vignette has coverage of it. The command creates an expanded data set with case weights, such that coxph() on the new data set = the Fine Gray model for the original data. Anything that works with coxph is valid on the new data. Caveat -- I don't know mice() well at all: the dat1 data set below has multiple observations per subject, should the mice() command be cognizant of this? Terry Therneau library(survival) library(mice) test1 <- data.frame (time=c(4,3,1,1,2,2,3,5,2,4,5,1, 4,3,1,1,2,2,3,5,2,4,5,1), status=c(1,1,1,0,2,2,0,0,1,1,2,0, 1,1,1,0,2,2,0,0,1,1,2,0), x=c(0,2,1,1,NA,NA,0,1,1,2,0,1, 0,2,1,1,NA,NA,0,1,1,2,0,1), sex=c(0,0,0,NA,1,1,1,1,NA,1,0,0, 0,0,0,NA,1,1,1,1,NA,1,0,0))) # Endpoint 1, simple, then with mice fit0 <- copxh(Surv(time, status==1) ~ sex + x, test1) dat0 <- mice(test1, m=10) mfit0 <- with(dat0, coxph(Surv(time, status==1) ~ sex + x)) summary(pool(mfit0)) # Endpoint 1, Fine-Gray model fg1 <- finegray(Surv(time, factor(status)) ~ ., data=test1, etype=1) fit1 <- coxph(Surv(fgstart, fgstop, fgstatus) ~ sex + x, data=fg1, weight= fgwt) dat1 <- mice(fg1, m=10) mfit1 <- with(dat1, coxph(Surv(fgstart, fgstop, fgstatus) ~ sex + x, weight=fgwt)) On 01/23/2017 05:00 AM, r-help-requ...@r-project.org wrote: Here is an example: # example library(survival) library(mice) library(cmprsk) test1 <- as.data.frame(list(time=c(4,3,1,1,2,2,3,5,2,4,5,1, 4,3,1,1,2,2,3,5,2,4,5,1), status=c(1,1,1,0,2,2,0,0,1,1,2,0, 1,1,1,0,2,2,0,0,1,1,2,0), x=c(0,2,1,1,NA,NA,0,1,1,2,0,1, 0,2,1,1,NA,NA,0,1,1,2,0,1), sex=c(0,0,0,NA,1,1,1,1,NA,1,0,0, 0,0,0,NA,1,1,1,1,NA,1,0,0))) dat <- mice(test1,m=10) #Cox regression: cause 1 models.cox1 <- with(dat,coxph(Surv(time, status==1) ~ x +sex )) summary(pool(models.cox1)) #Cox regression: cause 1 or 2 models.cox <- with(dat,coxph(Surv(time, status==1 | status==2) ~ x +sex )) models.cox summary(pool(models.cox)) # crr() #Fine-Gray model models.FG<- with(dat,crr(ftime=time, fstatus=status, cov1=test1[,c( "x","sex")], failcode=1, cencode=0, variance=TRUE)) summary(pool(models.FG)) #Error in pool(models.FG) : Object has no vcov() method. models.FG -- Andreu Ferrero Gregori __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Hyperbola formula
This simple form of a hyperbola is not well known. I find it useful for change point models: since the derivative is continuous it often behaves better in a maximizer. h1 <- function(x, b, k=3) .5 * b * (x + sqrt(x^2 + k^2)) Function h1() has asymptotes of y=0 to the left of 0 and y=x to the right of 0. The parameter k controls how tightly the curve bends at zero. A generalization is h2 <- function (x, b, k=3) { z <- x - b[4] b[1] + b[2]*z + .5*b[3]* (z + sqrt(z^2 + k^2)) } b[1] and b[2] are the intercept and slope of the left portion of the curve, b[4] is the inflection point, and b[3] is the change in slope at the inflection point. The main downside is that k is arbitrary. I simply choose it to "look right", though it too could be part of an optimization. Terry Therneau __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Parallel package guidance needed
I have a process that I need to parallelize, and have a question about two different ways to proceed. It is essentially an MCMC exploration where the likelihood is a sum over subjects (6000 of them), and the per-subject computation is the slow part. Here is a rough schematic of the code using one approach: mymc <- function(formula, data, subset, na.action, id, etc) { # lots of setup, long but computationally quick hlog <- function(thisid, param) { # compute the loglik for this subject ... } uid <- unique(id) # multiple data rows for each subject for (i in 1:burnin) { param <- get_next_proposal() loglist <- mclapply(uid, hlog, param=param) loglik <- sum(unlist(loglist)) # process result } # Now the non-burnin MCMC iterations � } The second approach is to put cluster formation outside the loop, e.g., ... clust <- makeForkCluster() for (i in 1:burnin) { param <- get_next_proposal() loglist <- parLapply(clust, uid, hlog, param=param) loglik <- sum(unlist(loglist)) # process result } # rest of the code stopCluster(clust) -- On the face of it, the second looks like it "could" be more efficient since it only starts and stops the subprocesses once. A short trial on one of our cluster servers seems to say the opposite. The load average on a quiet machine never gets much over 5-6 using method 2, and in the 20s for method 1 (detectCores() =80 on the box, we used mc.cores=50). Wall time for method 2 is looking to be several hours. Any pointers to documentation/discussion at this level would be much appreciated. I'm going to be fitting a lot of models. Terry T. [[alternative HTML version deleted]] __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] which parallel routine
I'm looking for advice on which of the parallel systems to use. Context: maximize a likelihood, each evaluation is a sum over a large number of subjects (>5000) and each of those per subject terms is slow and complex. If I were using optim the context would be fit <- optim(initial.values, myfun, �.) myfun <- function(params) { � Do some initial setup� temp <- apply-in-parallel(id, per-subject-eval-fun, p=params) unlist(temp) } The use of mcapply seems like there would be a lot of overhead starting and stopping threads. But none of the tutorials I've found addresses this particular question. Both direct answers and pointers to other docs would be welcome. Terry T. [[alternative HTML version deleted]] __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] independent censoring
On 11/29/2016 05:00 AM, r-help-requ...@r-project.org wrote: Independent censoring is one of the fundamental assumptions in the survival analysis. However, I cannot find any test for it or any paper which discusses how real that assumption is. I would be grateful if anybody could point me to some useful references. I have found the following paper as an interesting reference but it is not freely available. Leung, Kwan-Moon, Robert M. Elashoff, and Abdelmonem A. Afifi. "Censoring issues in survival analysis." Annual review of public health 18.1 (1997): 83-104. This is because there is no test for independent censoring. Say I am following a cohort of older gentlemen (65 years old) who were diagnoses with condition "x", and after 8 years there a a dozen who no longer answer my letters. Why not? a. Perhaps because they are in a nursing home, with dementia. b. Perhaps because they have moved to another city to be interact and be near to grandchildren. In case a, those lost to follow-up are much sicker than the average subject, and in case b they are most likely the most healthy and active of the group. In a) the KM will over-estimate survival and in b it will underestimate. The main point is that there is absolutely no way to know, other than actually tracking the subjects down. Any study which has a substantial fraction with incomplete follow-up is making a guess. The more accurate phrase would be "a blind hope for independent censoring" than "assume". There are cases where simple reasoning or experience tells me that this hope is futile, but mostly we just hope. The alternative is proactive follow-up, i.e., devote enough staff and resources to actively contact all of the study subjects on a regular schedule. Even then you will lose a few. (In one study several years ago, long term follow-up of cancer, there was a new Mrs Smith who refused to acknowledge the existence of the prior wife, even to forward letters.) Terry Therneau __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Updated package: survival_2.40-1
Survival version 2.40 has been relased to CRAN. This is a warning that some users may see changes in results, however. The heart of the issue can be shown with a simple example. Calculate the following simple set of intervals: <>= birth <- as.Date("1973/03/10") start <- as.Date("1998/09/13") + 1:40 end <- as.Date("1998/12/03") + rep(1:10, 4) interval <- (end-start) table(interval) 51 61 71 81 10 10 10 10 @ Each interval has a different start and end date, but there are only 4 unique intervals, each of which appears 10 times. Now convert this to an age scale. <>= start.age <- as.numeric(start-birth)/365.25 end.age <- as.numeric(end -birth)/365.25 age.interval <- end.age - start.age table(match(age.interval, unique(age.interval))) 1 2 3 4 5 6 7 8 9 1 5 5 1 9 7 3 @ There are now eight different age intervals instead of 4, and the 8 unique values appear between 1 and 9 times each. Exact results likely will depend on your computer system. We have become a victim of round off error. Some users prefer to use time in days and some prefer time in years, and those latter users expect, I am sure, survival analysis results to be identical on the two scales. Both the coxph and survfit routines treat tied event times in a special way, and this roundoff can make actual ties appear as non-tied values, however. Parametric survival such as \code{survreg} is not affected by this issue. In survival version 2.40 this issue has been addressed for the coxph and survfit routines; input times are subjected to the same logic found in the all.equal routine in order to determine actual ties. The upshot is that some users may experience a changed results. For the following test case cox1 and cox2 are identical coefficients in version 2.40, but different in prior versions. <<>>= ndata <- data.frame(id=1:30, birth.dt = rep(as.Date("1953/03/10"), 30), enroll.dt= as.Date("1993/03/10") + 1:30, end.dt = as.Date("1996/10/21") + 1:30 + rep(1:10, 3), status= rep(0:1, length=30), x = 1:30) ndata$enroll.age <- with(ndata, as.numeric(enroll.dt - birth.dt))/365.25 ndata$end.age<- with(ndata, as.numeric(end.dt - birth.dt))/365.25 fudays <- with(ndata, as.numeric(end.dt - enroll.dt)) fuyrs <- with(ndata, as.numeric(end.age- enroll.age)) cox1 <- coxph(Surv(fudays, status) ~ x, data=ndata) cox2 <- coxph(Surv(fuyrs, status) ~ x, data=ndata) @ This general issue of floating point precision arises often enough in R that is part of the frequently asked questions, see FAQ 7.31 on CRAN. The author of the survival routines (me) has always used days as the scale for analysis -- just by habit, not for any particluarly good reason -- so the issue had never appeared in my work nor in the survival package's test suite. Due to user input, this issue had been addressed earlier in the survfit routine, but only when the status variable was 0/1, not when it is a factor. As a final footnote, the simple data set above also gives different results when coded in SAS: I am not alone in overlooking it. As a consequence, the maintainer expects to get new emails that ``we have found a bug in your code: it gives a different answer than SAS''. (This is an actual quote.) Terry Therneau __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] isssues with predict.coxph, offset, type = "expected", and newdata
I'm off on vacation and checking email only intermittently. Wrt the offset issue, I expect that you are correct. This is not a case that I had ever envisioned, and so was not on my "list" when writing the code and certainly has no test case. That does not mean that it shouldn't work, just that I am not shocked to see it. I will look into this. For the second case of a NULL model I am less sympathetic. This is, in theory, just reading off values from a Nelson hazard estimate at specific time points; using a coxph call to do so is a case of swatting a fly with a hammer. A bit more background might make me more excited about extending the code to this case. Terry Therneau __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] finegray function in the survival package
On 09/07/2016 05:00 AM, r-help-requ...@r-project.org wrote: Dear R-Team, I have been trying to use the finegray routine that creates a special data so that Fine and Gray model can be fit. However, it does not seem to work. Could you please help me with this issue? Thanks, Ahalya. You have given us no details of your example code that "doesn't work", and I can't read your mind. So no, we can't help. Give us a hint. Terry T __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Conditional gap time frailty cox model for recurrent events
You can ignore the message below. The maximizing routine buried within the frailty() command buried with coxph() has a maximizer that is not the brightest. It sometimes gets lost but then finds its way again. The message is from one of those. It likely took a not-so-good update step, and took a couple of iterations to recover. In coxpenal.fit(X, Y, strats, offset, init = init, control, weights = weights, : Inner loop failed to coverge for iterations 3 4 To be fair the maximizing problem for a mixed effects Cox model is not easy. In the coxme code I have spent much more time on the details of this. Terry Therneau -- On 09/06/2016 05:00 AM, r-help-requ...@r-project.org wrote: Dear Elisabetta, I have no direct answer to your question, but a suggestion: Use the 'coxme' function (in the package with the same name). In the help page for 'frailty' (survival) you will find: "The coxme package has superseded this method. It is faster, more stable, and more flexible." Hth, G?ran On 2016-09-05 11:42, Elisabetta Petracci wrote: >Dear users, > >I am fitting a conditional gap time frailty cox model weighting >observations by means of inverse probability time dependent weigths. >Attached find the self-explaining dataset. > >I have used the following sintax: > >coxph(Surv(gaptstart,gaptstop,status)~treat+strata(nrecord01)+frailty(id,distribution="gamma",method="em"), >data=dataNOTDrr,weights=dataNOTDrr$weight) > > >And I get the following warning: > >Warning message: >In coxpenal.fit(X, Y, strats, offset, init = init, control, weights = >weights, : > Inner loop failed to coverge for iterations 3 4 > > >I have tried to: >- leave out the weights but I get the error anyway >- to randomly select a subset of patients and I don't get the error. This >seems to suggest that the problem is with some observations. > >Any suggestion? > >Many thanks, > >Elisabetta > __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] problem concernig Survsplit, package survival
On 08/20/2016 05:00 AM, Vinzenz wrote: For some days I have been struggling with a problem concerning the ?survSplit?-function of the package ?survival?. Searching the internet I have found a pretty good -German- description of Daniel Wollschl?r describing how to use survSplit: The survSplit routine was recently updated to allow a formula as the first argument. This change makes the routine much easier to use and more flexible. Old forms of the call should have worked as well, but unfortunately I introduced a bug in the code. For the time being, change your call to dfSurvCP <- survSplit(Surv(obsT, status) ~ ., dfSurv, cut=seq(30, 90, by=30), id="ID", zero=0) I will fix this. I apologize for the error. Terry Therneau __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Johannes Hengelbrock <j.hengelbr...@uke.de>
I'm traveling so chasing this down more fully will wait until I get home. Four points. 1. This is an edge case. You will notice that if you add "subset=1:100" to the coxph call that the function works perfectly. You have to get up to 1000 or so before it fails. 2. The exact partial likelihood of Cox is referred to as "exact" by coxph and as "discrete" by SAS phreg. What phreg calls 'exact' is the exact marginal likelihood of Prentice. I don't know which of these you were using in SAS so can't verify that "phreg works". 3. The computations and memory for the exact calculation go up phenomenally with the number of ties. It is a sum over n choose d terms, if there were 10 events out of 200 at risk this is a sum over all ways to choose 10 subjects out of 200 which is approx 2e16 terms. Your example requires all choices of 1541 out of 3000, which I would expect to take somewhere near age-of-the-universe seconds to compute. The code uses a clever nested compuation due to Gail et al which will cut that time down to infinity/10. 4. This example drove coxph into a memory fault, I suspect. I will certainly look into patching that once I get home. (There is a check for this but it must have a flaw). My sympathy is for your plight is low, however. I can't conceive of the real data problem where someone would actually need to compute this awful likelihood. 1541 events tied at the same time? Or even more to imagine a case where I would need it badly enough to wait a lifetime for the answer. The Efron approximation is pretty darn good for cases like this, and it is fast. Terry Therneau --- Dear users, I'm trying to estimate a conditional logistic model using the coxph()-function from the survival package. Somehow, the model does not converge if time is set to the same value for all observations: library(survival) set.seed(12345) n <- 3000 a <- rbinom(n, 1, 0.5) b <- rbinom(n, 1, 0.5) coxph(formula = Surv(rep(1, 3000), a) ~ b, method = "exact") Error in fitter(X, Y, strats, offset, init, control, weights = weights, : NA/NaN/Inf in foreign function call (arg 5) In addition: Warning message: In fitter(X, Y, strats, offset, init, control, weights = weights, :Ran out of iterations and did not converge Changing iter.max does not help, aparently. Strangely, the exact same model converges in SAS. I know that I could estimate the model differently (via glm), but I would like to understand why the model does converge in SAS but not in R. Thanks, Johannes [[alternative HTML version deleted]] __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] [R-pkgs] Survival 2.39
A new version of the survival package has been released. The biggest change is stronger support for multi-state models, which is an outgrowth of their increasing use in my own practice. Interested users are directed to the "time dependent covariates" vignette for discussion of the tmerge and survSplit functions, which are useful tools to build the requisite data sets, and to the "multi-state" vignette for model fits and plots. Terry Therneau ___ R-packages mailing list r-packa...@r-project.org https://stat.ethz.ch/mailman/listinfo/r-packages __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] simple interactions
I was right that there is an easy answer! Thanks for the 3 quick answers, all three correct and useful. Terry Therneau On 04/15/2016 07:15 AM, Thierry Onkelinx wrote: Dear Terry, Does fitting group + age:group instead of age*group solves your problem? Best regards, ir. Thierry Onkelinx 2016-04-15 13:58 GMT+02:00 Therneau, Terry M., Ph.D. <thern...@mayo.edu <mailto:thern...@mayo.edu>>: I'd like to get interaction terms in a model to be in another form. Namely, suppose I had variables age and group, the latter a factor with levels A, B, C, with age * group in the model. What I would like are the variables "age:group=A", "age:group=B" and "age:group=C" (and group itself of course). The coefficients of the model will then be the age effect in group A, the age effect in group B and the age effect in C rather than the standard ones of an overall age effect followed by contrasts. These is often a better format for tables in a publication. Yes, I can reconstruct these from the original fit, but I have a lot of variables for several models and it would be easier to have an automatic form. I suspect that there is an easy answer, but I don't see it. Terry Therneau __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] simple interactions
I'd like to get interaction terms in a model to be in another form. Namely, suppose I had variables age and group, the latter a factor with levels A, B, C, with age * group in the model. What I would like are the variables "age:group=A", "age:group=B" and "age:group=C" (and group itself of course). The coefficients of the model will then be the age effect in group A, the age effect in group B and the age effect in C rather than the standard ones of an overall age effect followed by contrasts. These is often a better format for tables in a publication. Yes, I can reconstruct these from the original fit, but I have a lot of variables for several models and it would be easier to have an automatic form. I suspect that there is an easy answer, but I don't see it. Terry Therneau __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Using R for cURL commands,Message-ID:
On 04/02/2016 05:00 AM, r-help-requ...@r-project.org wrote: Hello, I'm looking for a way in which R can make my live easier. Currently i'm using R convert data from a dataframe to json's and then sending these json's to a rest api using a curl command in the terminal (i'm on a mac). I've been looking for a way to use R for sending data from R to the rest api. My primairy focus was on using R for executing the curl command, however, I'm open to other approaches. The method I've been using so far: - These are lines snipped from a working application. I make use of the httr package. The parameters of the call are assembled as a nested list "xlist" prior to this section. The set of errors I check for is unique to the API I'm talking to, an internal one that's a bit odd in its return codes, but still it gives you the idea. auth <- authenticate(id$lanid, id$password, type="basic") query <- POST(url=control$posturl, body= xlist, auth, encode="json") if (status_code(query) >= 300) handle_reset(control$posturl) if (status_code(query) == 401) stop("invalid lanid/password for this application") else if (status_code(query) == 400) stop("internal error from dart package: 'syntax of request'") else stop_for_status(query) #other query errors, e.g. server down __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Convergence issues when using ns splines (pkg: spline) in Cox model (coxph) even when changing coxph.control
Thanks to David for pointing this out. The "time dependent covariates" vignette in the survival package has a section on time dependent coefficients that talks directly about this issue. In short, the following model is simply wrong: coxph(Surv(time, status) ~ trt + prior + karno + I(karno * log(time)), data=veteran) People try this often as a way to create the time dependent covariate Karnofsky * log(t), which is often put forwards as a way to deal with non-proportional hazards. To do this correctly you have to use the tt() functionality in coxph to move the computation out of the model statement: coxph(Surv(time, status) ~ trt + prior + karno + tt(karno), data=veteran, tt = function(x, time, ...) x*log(time)) BTW the following SAS code is also wrong: proc phreg data=veteran; model time * status(0) = trt + prior + karno* time; SAS does the right thing, however, if you move the computation off the model line. model time * status(0) = trt + karno + zzz; zzz = karno * time; The quote "SAS does it but R fails" comes at me moderately often in this context. The reason is that SAS won't LET you put a phrase like "log(time)" into the model statement, so people end up doing the right thing, but by accident. Terry T. On 03/30/2016 05:28 PM, Göran Broström wrote: On 2016-03-30 23:06, David Winsemius wrote: On Mar 29, 2016, at 1:47 PM, Jennifer Wu, Misswrote: Hi, I am currently using R v3.2.3 and on Windows 10 OS 64Bit. I am having convergence issues when I use coxph with a interaction term (glarg*bca_py) and interaction term with the restricted cubic spline (glarg*bca_time_ns). I use survival and spline package to create the Cox model and cubic splines respectively. Without the interaction term and/or spline, I have no convergence problem. I read some forums about changing the iterations and I have but it did not work. I was just wondering if I am using the inter.max and outer.max appropriately. I read the survival manual, other R-help and stackoverflow pages and it suggested changing the iterations but it doesn't specify what is the max I can go. I ran something similar in SAS and did not run into a convergence problem. This is my code: bca_time_ns <- ns(ins_ca$bca_py, knots=3, Boundary.knots=range(2,5,10)) test <- ins_ca$glarg*ins_ca$bca_py test1 <- ins_ca$glarg*bca_time_ns In your `coxph` call the variable 'bca_py' is the survival time and Right David: I didn't notice that the 'missing main effect' in fact was part of the survival object! And as you say: Time to rethink the whole model. Göran yet here you are constructing not just one but two interactions (one of which is a vector but the other one a matrix) between 'glarg' and your survival times. Is this some sort of effort to identify a violation of proportionality over the course of a study? Broström sagely points out that these interactions are not in the data-object and subsequent efforts to refer to them may be confounded by the multiple environments from which data would be coming into the model. Better to have everything come in from the data-object. The fact that SAS did not have a problem with this rather self-referential or circular model may be a poor reflection on SAS rather than on the survival package. Unlike Therneau or Broström who asked for data, I suggest the problem lies with the model construction and you should be reading what Therneau has written about identification of non-proportionality and identification of time dependence of effects. See Chapter 6 of his "Modeling Survival Data". __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] convergence issues with coxph
Failure to converge in a coxph model is very rare. If the program does not make it in 20 iterations it likely will never converge, so your control argument will do little. Without the data set I have no way to guess what is happening. My first question, however, is to ask how many events you have, e.g. table(bca). I count 19 covariates on the right hand side, and a good rule of thumb is that one should have at least 5- 10 endpoints per covariate for simple numerical stability and 10-20 for statistical stability. That means 100-200 events. Most medical data sets have fewer than this. A data set with 5000 rows and 4 death counts as "4" in this calculation by the way. I am always interested in data sets that push the boundaries of the code and can look deeper if you want to send me a copy. Details of how things are coded can matter, e.g., centered covariates. Otherwise there is little we can do. Terry Therneau On 03/30/2016 05:00 AM, r-help-requ...@r-project.org wrote: I am having convergence issues when I use coxph with a interaction term (glarg*bca_py) and interaction term with the restricted cubic spline (glarg*bca_time_ns). I use survival and spline package to create the Cox model and cubic splines respectively. Without the interaction term and/or spline, I have no convergence problem. I read some forums about changing the iterations and I have but it did not work. I was just wondering if I am using the inter.max and outer.max appropriately. I read the survival manual, other R-help and stackoverflow pages and it suggested changing the iterations but it doesn't specify what is the max I can go. I ran something similar in SAS and did not run into a convergence problem. This is my code: bca_time_ns <- ns(ins_ca$bca_py, knots=3, Boundary.knots=range(2,5,10)) test <- ins_ca$glarg*ins_ca$bca_py test1 <- ins_ca$glarg*bca_time_ns coxit <- coxph.control(iter.max=1, outer.max=1) bca<-coxph(Surv(bca_py,bca) ~ glarg + test + test1 + age + calyr + diab_dur + hba1c + adm_met + adm_sulfo + adm_tzd + adm_oth + med_statin + med_aspirin + med_nsaids + bmi_cat + ccscat + alc + smk, data=ins_ca, control=coxit, ties=c("breslow")) __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] plotting spline term when frailty term is included
On 03/02/2016 05:00 AM, r-help-requ...@r-project.org wrote: I'd very much appreciate your help in resolving a problem that I'm having with plotting a spline term. I have a Cox PH model including a smoothing spline and a frailty term as follows: fit<-coxph(Surv(start,end,exit) ~ x + pspline(z) + frailty(a)) When I run a model without a frailty term, I use the following in order to plot the splines: termplot(fit, term = 2, se = TRUE, ylab = "Log Hazard", rug=TRUE, xlab = "z_name") However, when the frailty term is included, it gives this error: Error in pred[, first] : subscript out of bounds What am I doing wrong here? Or is it the case that termplot does not work when splines and frailty are included? There are 3 parts to the answer. 1. The first is a warning: wrt to mixed effects Cox models, I shifted my energy to coxme over 10 years ago. The "penalized add on to coxph" approach of the frailty function was an ok first pass, but is just too limited for any but the simplest models. I'm unlikely fix issues, since there are others much higher on my priority list. 2. As Dave W. pointed out, the key issue is that predict(type='terms') does not work with for a model with two penalized terms, when one is frailty and the other pspline. Termplot depends on predict. 3. Again, as Dave W pointed out, the whole issue of what the "correct" answer would be gets much more complicated when one adds random effects to the mix; some things have not done because it is not clear where to go. (Survival curves for a mixed effects model only recently got added to my "todo" list, even though it has been on the wish list forever, because I finally have a notion of what a good approach would be.) In your case I'd advise an end run: fit the model using ns() instead of pspline. I like smoothing splines better than regression splines, but the fact is that for most data sets they result in nearly identical answers. Terry T __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Estimating Mean of a Poisson Distribution Based on interval censoring
For an interval censored poisson or lognormal, use survreg() in the survival package. (Or if you are a SAS fan use proc lifereg). If you have a data set where R and SAS give different answers I'd like to know about it, but my general experience is that this is more often a user error. I am also curious to learn exactly what you mean by "interval censored poisson". Exponential with interval time to first event is equivalent to poisson, which is what I'd guess from "lognormal", but you may mean something else. Terry Therneau (author of survival) On 02/14/2016 05:00 AM, r-help-requ...@r-project.org wrote: Dear all, I appreciate that if you let me know if there is any package implemented in R for Estimating Mean of a Poisson Distribution Based on Interval censoring? And if yes, could you please provide some information about it:)? By the way, is there anything for lognormal?I think fitdistcens is not good for this purpose as it gives me different result compared to SAS and only useful for right/left censoring and not interval censoring?(or both left and right together).? Kind regards,Mohsen __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Survival::coxph (clogit), survConcordance vs. summary(fit) concordance
I read the digest form which puts me behind, plus the last 2 days have been solid meetings with an external advisory group so I missed the initial query. Three responses. 1. The clogit routine sets the data up properly and then calls a stratified Cox model. If you want the survConcordance routine to give the same answer, it also needs to know about the strata survConcordance (Surv(rep(1, 76L), resp) ~ predict(fit) + strata(ID), data=dat) I'm not surprised that you get a very different answer with/without strata. 2. I've never thought of using a robust variance for the matched case/control model. I'm having a hard time wrapping my head around what you would expect that to accomplish (statistically). Subjects are already matched on someone from the same site, so where does a per-site effect creep in? Assuming there is a good reason and I just don't see it (not an unwarranted assumption), I'm not aware of any work on what an appropriate variance would be for the concordance in that case. 3. I need to think about the large variance issue. Terry Therneau On 01/20/2016 08:09 PM, r-help-requ...@r-project.org wrote: Hi, I'm running conditional logistic regression with survival::clogit. I have "1-1 case-control" data, i.e., there is 1 case and 1 control in each strata. Model: fit <- clogit(resp ~ x1 + x2, strata(ID), cluster(site), method ="efron", data = dat) Where resp is 1's and 0's, and x1 and x2 are both continuous. Predictors are both significant. A snippet of summary(fit): Concordance= 0.763 (se = 0.5 ) Rsquare= 0.304 (max possible= 0.5 ) Likelihood ratio test= 27.54 on 2 df, p=1.047e-06 Wald test= 17.19 on 2 df, p=0.0001853 Score (logrank) test = 17.43 on 2 df, p=0.0001644, Robust = 6.66 p=0.03574 The concordance estimate seems good but the SE is HUGE. I get a very different estimate from the survConcordance function, which I know says computes concordance for a "single continuous covariate", but it runs on my model with 2 continuous covariates survConcordance(Surv(rep(1, 76L), resp) ~ predict(fit), dat) n= 76 Concordance= 0.9106648 se= 0.09365047 concordant discordant tied.risk tied.timestd(c-d) 1315. 129. 0. 703. 270.4626 Are both of these concordance estimates valid but providing different information? Is one more appropriate for measuring "performance" (in the AUC sense) of conditional logistic models? Is it possible that the HUGE SE estimate represents a convergence problem (no warnings were thrown when fit the model), or is this model just useless? Thanks! __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] clogit and weights
How should the weights be treated? If they are multiple observation weights (a weight of "3" is shorthand for 3 subjects) that leads to a different likelihood than sampling weights ("3" means to give this one subject more influence). The clogit command can't read your mind and so has chosen not to make a guess. Also, please do not post in html. As you see below it leads to a mangled message. Terry Therneau On 12/22/2015 05:00 AM, r-help-requ...@r-project.org wrote: Merry Christmas everyone: I have the following data(mydat) and would like to fit a conditional logistic regression model considering "weights". id? case?exposure?? weights 1?1?1? 2 1?0?0? 2 2?1?1? 2 2?0?0? 2 3?1?1? 1 3?0?0? 1 4?1?0 ?2 4?0?1? 2 ?The R function"clogit" is for such purposes but it ignores weights.?I tried function"mclogit" instead which seems that it considers the weights option:##options(scipen=999)library(mclogit)# create the above data frameid? = c(1,1,2,2,3,3,4,4)case? =?c(1,0,1,0,1,0,1,0)exposure = c(1,0,1,0,1,0,0,1)weights? = c(2,2,2,2,1,1,2,2)(mydata??= data.frame(id,case,exposure,weights)) fit??= mclogit(cbind(case,id) ~ exposure,weights=weights, data=mydata)summary(fit)## The answer,however,?doesn't seem to be?correct. Could anyone?pleaseprovides me with some solution to this??Thanks in advance,Keramat Nourijelyani,PhD?? [[alternative HTML version deleted]] __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Survival analysis: ERROR: Time and status are different lengths
I expect that reading the result of print(fit.weib) will answer your question. If there were any missing values in the data set, then the fit.weib$linear.predictors will be shorter than the original data set, and the printout will have a note about "...deleted due to missing". The simplest solution to this is to set options(na.action="na.exclude") before doing the fit. Then predict(fit) and resid(fit) will return vectors of the same length as the input data, containing NA in the appropriate positions. The default na.action of "na.omit" leaves missing out of both the fit and the residuals. (Unfortunately, only a few modeling functions in R pay attention to the difference between these two na.action options.) Terry Therneau On 12/04/2015 05:00 AM, r-help-requ...@r-project.org wrote: Hi, I am fitting an AFT model assuming a Weibull distribution and I would like to check the residuals compared to the Kaplan Meier residuals, but when I try to create the Kaplan Meier residuals I get an error: Time and status are different lengths. I am using the following script: # Fitting the AFT model fit.weib <- survreg(Surv(TimeDeath, event) ~ age + sex + mutation + ventilation + BM1 + BM2, data = DF, dist = "weibull") fits <- fit.weib$linear.predictors resids <- (fit.weib$y[, 1] - fits)/fit.weib$scale resKM <- survfit(Surv(resids, event) ~ 1, data = DF) I get the error from the last line of the script. I tried some things that didn't work and I was wondering if anyone could help me. If you need more information please let me know. Thanks in advance, __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] SWEAVE - a gentle introduction
As a digest reader I am late to the discussion, but let me toss in 2 further notes. 1. Three advantages of knitr over Sweave a. The book "Dynamic documents with R and knitr". It is well written; sitting down for an evening with the first half (70 pages) is a pretty good way to learn the package. The second half covers features you may or may not use over time. My only complaint about the book is that it needs a longer index; I have had several cases of "I know I read about xxx, but where was it?". But this is ameliorated by b. Good online resources: manuals, tutorials, and question/answer pairs on various lists. c. Ongoing support. Sweave is static. 2. Latex vs markdown (knitr supports both) One can choose "latex style" or "markdown style" for writing your documents. I know latex very well (I wrote my book using it) but recommend markdown to all others in my department. The latter is about 1/3 the learning curve. Markdown produces very nice output, latex goes the extra mile to produce true book quality. But one rarely needs that extra polish, and even more rarely needs it enough to justify the extra learning cost. I still use the latex form myself as it is not at all difficult to use --- once you learn it. Terry Therneau On 11/18/2015 05:00 AM, r-help-requ...@r-project.org wrote: I am looking for a gentle introduction to SWEAVE, and would appreciate recommendations. I have an R program that I want to run and have the output and plots in one document. I believe this can be accomplished with SWEAVE. Unfortunately I don't know HTML, but am willing to learn. . . as I said I need a gentle introduction to SWEAVE. Thank you, John __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Error survreg: Density function returned an an invalid matrix
The error message states that there is an invalid value for the density. A long stretch of code is not very helpful in understanding this. What we need are the definition of your density -- as it would be written in a textbook. This formula needs to give a valid response for the range -infinity to +infinity. Or more precisely, for any value that the maximizer might guess at some point during the iteration. Terry T. On 11/14/2015 05:00 AM, r-help-requ...@r-project.org wrote: Thanks Terry but the error persists. See: >library(foreign)> library(survival)> library(VGAM) > mypareto <- list(name='Pareto',+ init= remainder of message trucated __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] recursive partitioning in R
Look at the rpart vignette "User written split functions". The code allows you to add your own splitting method to the code (in R, no C required). This has proven to be very useful for trying out new ideas. The second piece would be to do your own cross-validation. That is, turn off the built in cross-validation using the xval=0 option, then explicitly do the cross-validation yourself. Fit a new tree to some chosen subset of data, using your split rule of course, and then use predict() to get predicted values for the remaining observations. Again, this is all in R, and you can explicitly control your in or out of bag subsets. The xpred.rpart function may be useful to automate some of the steps. If you look up rpart on CRAN, you will see a link to the package source. If you were to read the C source code you will discover that 95% is boring bookkeeping of what observations are in what part(s) of the tree, sorting the data, tracking missing values, etc. If you ever do want to write your own code you are more than welcome to build off this --- I wouldn't want to write that part again. Terry Therneau On 11/12/2015 05:00 AM, r-help-requ...@r-project.org wrote: Dear List, I'd like to make a few modifications to the typical CART algorithm, and I'd rather not code the whole thing from scratch. Specifically I want to use different in-sample and out-of-sample fit criteria in the split choosing and cross-validation stages. I see however that the code for CART in both the rpart and the tree packages is written in C. Two questions: * Where is the C code? It might be possible to get a C-fluent programmer to help me with this. * Is there any code for CART that is written entirely in R? Thanks, Andrew __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Error survreg: Density function returned an an invalid matrix
Hi, I want to perform a survival analysis using survreg procedure from survival library in R for a pareto distribution for a time variable, so I set the new distribution using the following sintax: library(foreign) library(survival) library(VGAM) mypareto <- list(name='Pareto', init= function(x, weights,parms){ etc. The survreg routine fits location-scale distributions such that (t(y) - Xb)/s ~ F, where t is an optional transformation, F is some fixed distribution and X is a matrix of covariates. For any distribution the questions to ask before trying to add the distribution to survreg are - can it be written in a location-scale form? - if so, how do the parameters of the distribution map to the location (Xb) and scale (s). In fitting data we normally have per-subject location (X b) but an intercept-only model is of course possible. If y is Weibull then log(y) fits into the framework, which is how survreg fits it. The transformation of parameters location and scale parameters for log(y) back to the usual Weibull parameterization for y often trips people up (see comments in the Examples section of ?survreg). The log of a Pareto can be written in this form (I think?). The two parameters are the scale a and lower limit b, with survival function of S(x)= (b/x)^a, for x >= b. If y = log(x) the survival function for y is S(y) = (b/exp(y))^a = exp[-(y - log(b))/(1/a)], which has location log(b) and scale 1/a. But even if I am correct the discontinuity at b will cause the underlying Newton-Raphson method to fail. Terry Therneau __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] coxme adequacy check
On 10/28/2015 06:00 AM, r-help-requ...@r-project.org wrote: Hello all! I?m fitting a mixed effects cox model with coxme function of coxme package. I want to konw what is the best way to check the model adequacy, once that function cox.zph that does not work for coxme objects. Thanks in advanced, Raoni No one has done the theory work to see if the method used by coxme extends to random effects models. I suspect that it does, but that is a long way from a proof. So at this point I don't have a good suggestion, even though I have been thinking about the problem. Terry Therneau __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Fitting a curve to weibull distribution in R using nls
On 10/14/2015 05:00 AM, r-help-requ...@r-project.org wrote: I am trying to fit this data to a weibull distribution: My y variable is:1 1 1 4 7 20 7 14 19 15 18 3 4 1 3 1 1 1 1 1 1 1 1 1 and x variable is:1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 One could always use existing R functions that fit a Weibull regression, instead of reinventing the wheel. library(survival) y <- scan() 1 1 1 4 7 20 7 14 19 15 18 3 4 1 3 1 1 1 1 1 1 1 1 1 wfit <- survreg(Surv(1:24) ~ 1, weights=y, dist="weibull") wfit Call: survreg(formula = Surv(1:24) ~ 1, weights = y, dist = "weibull") Coefficients: (Intercept) 2.352121 Scale= 0.4130924 Loglik(model)= -351.4 Loglik(intercept only)= -351.4 n= 24 zz <- seq(0, 25, length=100) plot(zz, dsurvreg(zz, 2.352121, 0.4130924), col=2, type='l', ylim=c(0, .15), xlab="Value", ylab="Density") points(1:24, y/sum(y)) - There are a half dozen ways to parameterize a Weibull distribution; the location-scale form used by survreg is one of the less common. See help(survreg) for more information -- look at the example near the bottom of the page. Terry Therneau __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] rpart cutpoint interpretation
The cutpoint is on the predictor, so the interpretation is the same as it is for any other rpart model. The subjects with predictor < cutpoint form one group and those > cutpoint the other. The cutpoint is chosen to give the greatest difference in "average y" between the groups. For poisson "averge y" is an event rate. On 10/08/2015 05:00 AM, r-help-requ...@r-project.org wrote: I am trying to derive cutpoint/threshold with a poisson distributed dependent variable. I know how to interpret cutpoint with binary dependent variable based on direction. Can some on help me to intrepret cutpoint for poisson case with one independent variable with the derived threshold. __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] flatten a list
I'd like to flatten a list from 2 levels to 1 level. This has to be easy, but is currently opaque to me. temp <- list(1:3, list(letters[1:3], duh= 5:8), zed=15:17) Desired result would be a 4 element list. [[1]] 1:3 [[2]] "a", "b", "c" [[duh]] 5:8 [[zed]] 15:17 (Preservation of the names is not important) Terry T __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] XPT files
This was an FDA/SAS bargain a long while ago. SAS made the XPT format publicly available and unchanging in return for it becoming a standard for submission. Many packages can reliably read or write these files. (The same is not true for other SAS file formats, nor is xport the SAS default.) I do not know how good the R routines are, having never used them. The following snippit is taken from http://www.fda.gov/downloads/ForIndustry/DataStandards/StudyDataStandards/UCM312964.pdf 2 Dataset Specifications 2.1 File Format SAS XPORT transport file format, also called Version 5 SAS transport format, is an open format published by the SAS Institute. The description of this SAS transport file format is in the public domain. Data can be translated to and from this SAS transport format to other commonly usedformats without the use of programs from SAS Institute or any specific vendor. Sponsors can find the record layout for SAS XPORT transport files through SAS technical support technical document TS-140. This document and additional information about the SAS Transport file layout can be found on the SAS World Wide Web page at http://www.sas.com/fda-esub. --- Said document TS-140 talks about IBM 360 and Dec VAX machines but no others, which should give you an idea of its age. Terry Therneau On 09/27/2015 05:00 AM, r-help-requ...@r-project.org wrote: Peter Thanks for the explanation. One further comment ? you wrote: >I don't think the FDA "requests" XPT files In fact, they do make such a request. Here is the actual language received this week (and repeatedly in the past): >Program/script files should be submitted using text files (*.TXT) and the data should be submitted using SAS transport files (*.XPT). Dennis __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] retaining characters in a csv file
Thanks for all for the comments, I hadn't intended to start a war. My summary: 1. Most important: I wasn't missing something obvious. This is always my first suspicion when I submit something to R-help, and it's true more often than not. 2. Obviously (at least it is now), the CSV standard does not specify that quotes should force a character result. R is not "wrong". Wrt to using what Excel does as litmus test, I consider that to be totally uninformative about standards: neither pro (like Duncan) or anti (like Rolf), but simply irrelevant. (Like many MS choices.) 3. I'll have to code in my own solution, either pre-scan the first few lines to create a colClasses, or use read_csv from the readr library (if there are leading zeros it keeps the string as character, which may suffice for my needs), or something else. 4. The source of the data is a "text/csv" field coming from an http POST request. This is an internal service on an internal Mayo server and coded by our own IT department; this will not be the first case where I have found that their definition of "csv" is not quite standard. Terry T. On 23/09/15 10:00, Therneau, Terry M., Ph.D. wrote: I have a csv file from an automatic process (so this will happen thousands of times), for which the first row is a vector of variable names and the second row often starts something like this: 5724550,"000202075214",2005.02.17,2005.02.17,"F", . Notice the second variable which is a character string (note the quotation marks) a sequence of numeric digits leading zeros are significant The read.csv function insists on turning this into a numeric. Is there any simple set of options that will turn this behavior off? I'm looking for a way to tell it to "obey the bloody quotes" -- I still want the first, third, etc columns to become numeric. There can be more than one variable like this, and not always in the second position. This happens deep inside the httr library; there is an easy way for me to add more options to the read.csv call but it is not so easy to replace it with something else. __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Accounting for correlated random effects in coxme
I've been away for a couple weeks and am now catching up on email. The issue is that the coxme code does not have conversions built-in for all of the possible types of sparse matrix. Since it assumes that the variance matrix must be symmetric, the non-neccarily-symmetric dgCMatrix class is not one that I had considered. You should transform it to dsCMatrix first, which is a symmetric class. Or if it is small enough, to a simple matrix. Terry T. On 09/22/2015 05:00 AM, r-help-requ...@r-project.org wrote: I have a problem with running the mixed effects Cox regression model using a distance matrix from a phylogeny rather than a pedigree. I searched previous posts and didn't find any directly relevant previous posts. I am interested in using a mixed effects Cox regression model to determine the best predictors of time to recruitment in 80 different reintroduced plant populations representing a total of 31 species. I will like to account for correlated random effects that result from phylogenetic relationships amongst species. Dr. Therneau's 2015 article on Mixed Effects Cox Models provide a very helpful template for me to do this with the coxme function in R. In this article, the correlation structure due to genetic relationships amongst individuals was defined using a kinship matrix derived from a pedigree. Instead of a pedigree, I have a phylogeny for these 31 species. Hence, I used the inverseA function in the MCMCglmm package to generate an inverse additive genetic relatedness matrix from the phylogeny for these 31 species. And then fed it in as input to the varlist argument in my mixed effects cox regression model (using function coxme). I got an error message (please see below). Based on the error, one thought I had was to convert the inverseA matrix from a ?dgCMatrix? to ?bdsmatrix? but this was not successful either. I have also unsuccessfully tried to use a pairwise phylogenetic distance matrix. Is there a better way to do this? I basically just want to account for the correlated random effects due to phylogenetic relatedness amongst the 31 species represented in the dataset for the Cox regression model. Please see my code below and I welcome suggestions on how best to make this work. __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] retaining characters in a csv file
I have a csv file from an automatic process (so this will happen thousands of times), for which the first row is a vector of variable names and the second row often starts something like this: 5724550,"000202075214",2005.02.17,2005.02.17,"F", . Notice the second variable which is a character string (note the quotation marks) a sequence of numeric digits leading zeros are significant The read.csv function insists on turning this into a numeric. Is there any simple set of options that will turn this behavior off? I'm looking for a way to tell it to "obey the bloody quotes" -- I still want the first, third, etc columns to become numeric. There can be more than one variable like this, and not always in the second position. This happens deep inside the httr library; there is an easy way for me to add more options to the read.csv call but it is not so easy to replace it with something else. Terry T __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] using survreg() in survival package with "long" data
On 08/30/2015 05:00 AM, r-help-requ...@r-project.org wrote: I'm unable to fit a parametric survival regression using survreg() in the survival package with data in "counting-process" ("long") form. To illustrate using a scaled-down problem with 10 subjects (with data placed on the web): As usual I'm a day late since I read digests, and Goran has already clarified things. A discussion of this is badly needed in my as yet unwrritten book on using the survival package. From a higher level view: If an observation is interval censored (a,b) then one knows that the event happened between time "a" and time "b", but not when. The survreg routine can handle interval censored data since it is parametric (you need to integrate over the interval). The interval (-infinity, b) is called 'left censored' and the interval (a, infinity) is 'right censored'. Left censored data is rare in medical work, an example might be a chronic disease like rhuematoid arthritis where we know that the true disease onset was some time before the date it was first detected, and one is trying to deduce the duration of disease. Left truncation at time 'a' means that any events before time "a" are not in the data set. In a referral center like mine this includes any subjects who die before they come to us. The coxph model handles left truncation naturally via its counting process formulation. That same formulation also allows it to deal with time dependent covariates. Accelerated failure time models like survreg can handle left truncation in principle, but they require that the values of any covariates are known from time 0 -- even for a truncated subject. I have never added left-truncation to the survreg code, mostly because I have never needed it myself, but also because users would immediately think that they could accomplish time-dependent covariates by simply using a long format data set. Rather, each subject needs to be linked to a full covariate history, which is a bit more work. So: coxph does left truncation but not left (or interval) censoring survreg does interval censoring but not left truncation (or time dependent covariates). Terry T __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Survival analysis and predict time-to-death
I read this list a day late as a digest so my answers are rarely the first. (Which is nice as David W answers most of the survival questions for me!) What you are asking is reasonable, and in fact is common practice in the realm of industrial reliability, e.g., Meeker and Escobar, Statistical Methods for Reliability Analysis. Extrapolation of the survival curve to obtain the mean and percentiles of the lifetime distribution for some device (e.g. a washing machine) is their bread and butter, used for instance to determine the right size for an inventory of spare parts. For most of us on this list who do medical statistics and live in the Kaplan-Meier/ Cox model world the ideas are uncommon. I was lucky enough to sit through one of Bill Meeker's short courses and retain some (minimal) memory of it. 1. You are correct that parametric models are essential. If the extrapolation is substantial (30% or more censored, say), then the choice of distribution can be critical. If failure is due to repeated insult, e.g., the multi-hit model, then Weibull tends to be preferred; if it is from degradation, e.g., flexing of a diaphram, then the log-normal. Beyond this you need more guidance than mine. 2. The survreg routine assumes that log(y) ~ covariates + error. For a log-normal distribion the error is Gaussian and thus the predict(fit, type='response') will be exp(predicted mean of log time), which is not the predicted mean time. For Weibull the error dist is asymmetric so things are more muddy. Each is the MLE prediction for the subject, just not interpretable as a mean. To get the actual mean you need to look up the formulas for Weibull and/or lognormal in a textbook, and map from the survreg parameterization to whatever one the textbook uses. The two parameterizations are never the same. 3. Another option is predicted quantiles. ?predict.survreg shows how to get the entire survival curve. The mean can be obtained as the area under the survival curve. Relevant to your question, the expected time remaining for a subject still alive at time =10, say, is integral(S(t), from 10 to infin) / S(10), where S is the survival curve. You can also read off quantiles of the expected remaining life. Terry Therneau (author of the survival package) On 08/18/2015 05:00 AM, r-help-requ...@r-project.org wrote: Dear All, I would like to build a model, based on survival analysis on some data, that is able to predict the /*expected time until death*/ for a new data instance. Data For each individual in the population I have the, for each unit of time, the status information and several continuous covariates for that particular time. The data is right censored since at the end of the time interval analyzed, instances could be still alive and die later. Model I created the model using R and the survreg function: lfit - survreg(Surv(time, status) ~ X) where: - time is the time vector - status is the status vector (0 alive, 1 death) - X is a bind of multiple vectors of covariates Predict time to death Given a new individual with some covariates values, I would like to predict the estimated time to death. In other words, the number of time units for which the individual will be still alive till his death. I think I can use this: ptime - predict(lfit, newdata=data.frame(X=NEWDATA), type='response') Is that correct? Am I going to get the expected-time-to-death that I would like to have? In theory, I could provide also the time information (the time when the individual has those covariates values), should I simply add that in the newdata: ptime - predict(lfit, newdata=data.frame(time=TIME, X=NEWDATA), type='response') Is that correct? Is this going to improve the prediction? (for my data, the time already passed should be an important variable). Any other suggestions or comments? Thank you! __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Differences in output of lme() when introducing interactions
The following are in parody (but like all good parody correct wrt the salient features). The musings of Guernsey McPhearson http://www.senns.demon.co.uk/wprose.html#Mixed http://www.senns.demon.co.uk/wprose.html#FDA In formal publication: Senn, Statistical Issues in Drug Development, second edition, Chapter 14: Multicentre Trials Senn, The many modes of meta, Drug information journal, 34:535-549, 2000. The second points out that in a meta analysis no one would ever consider giving both large and small trials equal weights, and relates that to several other bits of standard practice. The 'equal weights' notion embedded in a fixed effects model + SAS type 3 is an isolated backwater. Terry T. PS. The Devils' Drug Development Dictionary at the same source has some gems. Three rather random choices: Bayesian - One who, vaguely expecting a horse and catching a glimpse of a donkey, strongly concludes he has seen a mule. Medical Statistician - One who won't accept that Columbus discovered America because he said he was looking for India in the trial Plan. Trend Towards Significance - An ever present help in times of trouble. On 07/22/2015 06:02 PM, Rolf Turner wrote: On 23/07/15 01:15, Therneau, Terry M., Ph.D. wrote: SNIP 3. Should you ever use it [i.e. Type III SS]? No. There is a very strong inverse correlation between understand what it really is and recommend its use. Stephen Senn has written very intellgently on the issues. Terry --- can you please supply an explicit citation? Ta. cheers, Rolf __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Differences in output of lme() when introducing interactions
Type III is a peculiarity of SAS, which has taken root in the world. There are 3 main questions wrt to it: 1. How to compute it (outside of SAS). There is a trick using contr.treatment coding that works if the design has no missing factor combinations, your post has a link to such a description. The SAS documentation is very obtuse, thus almost no one knows how to compute the general case. 2. What is it? It is a population average. The predicted average treatment effect in a balanced population-- one where all the factor combinations appeared the same number of times. One way to compute 'type 3' is to create such a data set, get all the predicted values, and then take the average prediction for treatment A, average for treatment B, average for C, ... and test are these averages the same. The algorithm of #1 above leads to another explanation which is a false trail, in my opinion. 3. Should you ever use it? No. There is a very strong inverse correlation between understand what it really is and recommend its use. Stephen Senn has written very intellgently on the issues. Terry Therneau On 07/22/2015 05:00 AM, r-help-requ...@r-project.org wrote: Dear Michael, thanks a lot. I am studying the marginality and I came across to this post: http://www.ats.ucla.edu/stat/r/faq/type3.htm Do you think that the procedure there described is the right one to solve my problem? Would you have any other online resources to suggest especially dealing with R? My department does not have a statician, so I have to find a solution with my own capacities. Thanks in advance Angelo __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] matrix manipulation
This is as much a mathematics as an R question, in the this should be easy but I don't see it category. Assume I have a full rank p by p matrix V (aside: V = (X'X)^{-1} for a particular setup), a p by k matrix B, and I want to complete an orthagonal basis for the space with distance function V. That is, find A such that t(B) %*% V %*% A =0, where A has p rows and p-k columns. With V=identity this is easy. I can do it in 1-2 lines using qr(), lm(), or several other tools. A part of me is quite certain that the general problem isn't more than 3 lines of R, but after a day of beating my head on the issue I still don't see it. Math wise it looks like a simple homework problem in a mid level class, but I'm not currently sure that I'd pass said class. If someone could show the way I would be grateful. Either that or assurance that the problem actually IS hard and I'm not as dense as I think. Terry T. __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] matrix manipulation -solved
Yes it is obvious --- once someone else pointed it out. Thanks for the hint. Terry T. On 07/16/2015 12:52 PM, Peter Langfelder wrote: Hi Terry, maybe I'm missing something, but why not define a matrix BB = V'B; then t(B) %*% V = t(BB), then your problem reduces to finding A such that t(BB) %*% A = 0? Peter On Thu, Jul 16, 2015 at 10:28 AM, Therneau, Terry M., Ph.D. thern...@mayo.edu wrote: This is as much a mathematics as an R question, in the this should be easy but I don't see it category. Assume I have a full rank p by p matrix V (aside: V = (X'X)^{-1} for a particular setup), a p by k matrix B, and I want to complete an orthagonal basis for the space with distance function V. That is, find A such that t(B) %*% V %*% A =0, where A has p rows and p-k columns. With V=identity this is easy. I can do it in 1-2 lines using qr(), lm(), or several other tools. A part of me is quite certain that the general problem isn't more than 3 lines of R, but after a day of beating my head on the issue I still don't see it. Math wise it looks like a simple homework problem in a mid level class, but I'm not currently sure that I'd pass said class. If someone could show the way I would be grateful. Either that or assurance that the problem actually IS hard and I'm not as dense as I think. Terry T. __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Variance estimates for survreg vs. lm
The difference is that survreg is using a maximum likelihood estimate (MLE) of the variance and that lm is using the unbiased (MVUE) estimate of variance. For simple linear regression, the former divides by n and the latter by n-p. The difference in your variances is exactly n/(n-p) = 10/8. With censored data the MLE estimate is still clear, but what exactly n is is not so obvious (does a censored datum count as a whole observation?), so a simple (n-p)/n variance correction is also not obvious. I would not be surprised if someone, somewhere has hammered out a correction for unbiased variance; I've never looked for it. I'm not convinced that it is worthwhile though. Terry Therneau On 07/04/2015 05:00 AM, r-help-requ...@r-project.org wrote: I would like help understanding why a survival regression with no censored data-points does not give the same variance estimates as a linear model (see code below). I think it must be something to do with the fact that the variance is an actual parameter in the survival version via the log(scale), and possibly that different assumptions are made about the distribution of the variance. But I really don't know, I'm just guessing. The reason I ask is because I am moving a process, that has always been modelled using a linear model, to a survival model (because there are sometimes a few censored data points). In the past, the censored data points have been treated as missing which imparts bias. The variance of the estimates in this process is key, so I need to know why they are changing in this systematic way?! library(survival) ctl - c(4.17,5.58,5.18,6.11,4.50,4.61,5.17,4.53,5.33,5.14) ctl.surv - Surv(ctl) trt - c(4.81,4.17,4.41,3.59,5.87,3.83,6.03,4.89,4.32,4.69) lmod - lm (ctl ~ trt) smod - survreg(ctl.surv ~ trt,dist=gaussian) coef(lmod) coef(smod) # same vcov(lmod) vcov(smod) # smod is smaller diag(vcov(lmod)) / diag(vcov(smod))[1:2] # 1.25 == 0.5*(n/(n-1)) ( summary(lmod)$coef [ ,Std. Error] / summary(smod)$table[1:2,Std. Error] )^2# 1.25 = 0.5*(n/(n-1)) __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] prmatrix and print.matrix
The help page for prmatrix states that it only exists for backwards compatability and strongly hints at using print.matrix instead. However, there does not seem to be a print.matrix() function. The help page for print mentions a zero.print option, but that does not appear to affect matrices. This (or something like it) is what I was looking for. Am I overlooking something? Terry Therneau __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Different behavior of model.matrix between R 3.2 and R3.1.1
Frank, I don't think there is any way to fix your problem except the way that I did it. library(survival) tdata - data.frame(y=c(1,3,3,5, 5,7, 7,9, 9,13), x1=factor(letters[c(1,1,1,1,1,2,2,2,2,2)]), x2= c(1,2,1,2,1,2,1,2,1,2)) fit1 - lm( y ~ x1 * strata(x2) - strata(x2), tdata) coef(fit1) (Intercept)x1b x1a:strata(x2)x2=2 x1b:strata(x2)x2=2 3.00 5.00 1.00 1.67 Your code is calling model.matrix with the same model frame and terms structure as the lm call above (I checked). In your case you know that the underlying model has 2 intercepts (strata), one for the group with x2=1 and another for the group with x2=2, but how is the model.matrix routine supposed to guess that? It can't, so model.matrix returns the proper result for the lm call. As seen above the result is not singular, while for the Cox model it is singular due to the extra intercept. This is simply an extension of leaving the intercept term in the model and then removing that column from the returned X matrix, which is necessary to have the correct coding for ordinary factor variables, something we've both done since day 1. In order for model.matrix to do the right thing with interactions, it has to know how many intercepts there actually are. I've come to the conclusion that the entire thrust of 'contrasts' in S was wrong headed, i.e., the remove redundant columns from the X matrix ahead of time logic. It is simply not possible for the model.matrix routine to guess correctly for all y and x combinations, something that been acknowledged in R by changing the default for singular.ok to TRUE. Dealing with this after the fact via a good contrast function (a la SAS -- heresy!) would have been a much better design choice. But as long as I'm in R the coxph routine tries to be a good citizen. Terry T. __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Different behavior of model.matrix between R 3.2 and R3.1.1
Frank, I'm not sure what is going on. The following test function works for me in both 3.1.1 and 3.2, i.e, the second model matrix has fewer columns. As I indicated to you earlier, the coxph code removes the strata() columns after creating X because I found it easier to correctly create the assign attribute. Can you create a worked example? require(survival) testfun - function(formula, data) { tform - terms(formula, specials=strata) mf - model.frame(tform, data) terms2 - terms(mf) strat - untangle.specials(terms2, strata) if (length(strat$terms)) terms2 - terms2[-strat$terms] X - model.matrix(terms2, mf) X } tdata - data.frame(y= 1:10, zed = 1:10, grp = factor(c(1,1,1,2,2,2,1,1,3,3))) testfun(y ~ zed*grp, tdata) testfun(y ~ strata(grp)*zed, tdata) Terry T. - original message -- For building design matrices for Cox proportional hazards models in the cph function in the rms package I have always used this construct: Terms - terms(formula, specials=c(strat, cluster, strata), data=data) specials - attr(Terms, 'specials') stra- specials$strat Terms.ns - Terms if(length(stra)) { temp - untangle.specials(Terms.ns, strat, 1) Terms.ns - Terms.ns[- temp$terms]#uses [.terms function } X - model.matrix(Terms.ns, X)[, -1, drop=FALSE] The Terms.ns logic removes stratification factor main effects so that if a stratification factor interacts with a non-stratification factor, only the interaction terms are included, not the strat. factor main effects. [In a Cox PH model stratification goes into the nonparametric survival curve part of the model]. Lately this logic quit working; model.matrix keeps the unneeded main effects in the design matrix. Does anyone know what changed in R that could have caused this, and possibly a workaround? --- __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Unwanted unicode - solved
Thanks to all. The suggestion below by Marc Schwarz (the second I tried) showed the problem. One of the references in one of the files had been pasted in and had a funky dash in its 111-196 page number. It looked just fine in my emacs window so I hadn't picked it up. There are 90 .Rd files so this saved me substantial time. Terry T. On 06/04/2015 03:00 PM, Marc Schwartz wrote: On Jun 4, 2015, at 12:56 PM, Therneau, Terry M., Ph.D. thern...@mayo.edu wrote: I'm checking the survival package and get the following error. How do I find the offending line? (There are a LOT of files in the man directory.) Terry T. -- * checking PDF version of manual ... WARNING LaTeX errors when creating PDF version. This typically indicates Rd problems. LaTeX errors found: ! Package inputenc Error: Unicode char \u8:‑ not set up for use with LaTeX. Terry, One possible option: require(tools) sapply(list.files(path = “Path/To/Your/RD/Files, pattern = .Rd), showNonASCIIfile) See ?list.files and ?showNonASCIIfile Regards, Marc Schwartz __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Unwanted unicode
I'm checking the survival package and get the following error. How do I find the offending line? (There are a LOT of files in the man directory.) Terry T. -- * checking PDF version of manual ... WARNING LaTeX errors when creating PDF version. This typically indicates Rd problems. LaTeX errors found: ! Package inputenc Error: Unicode char \u8:‑ not set up for use with LaTeX. __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Logistic regression R and Stata grouping variable
You were not completely clear, but it appears that you have data where each subject has results from 8 trials, as a pair of variables is changed. If that is correct, then you want to have a variance that corrects for the repeated measures. In R the glm command handles the simple case but not the repeated measures one. Statisticially you can use a generalized estimating equations approach (package gee) or a random effect per subject approach (lme or lmer package). Terry T. On 05/27/2015 05:00 AM, r-help-requ...@r-project.org wrote: I mostly use Stata 13 for my regression analysis. I want to conduct a logistic regression on a proportion/number of success. Because I receive errors in Stata I did not expect nor understand (if there are Stata experts who want to know more about the problems I face and can potentially help me solve them, I would be glad to give more details), I want to repeat the analysis in R. In Stata I would use the command: xtlogit DEP_PROP INDEP_A INDEP_B INDEP_C, i(ID). ID is the identifier for each subject. There are eight lines with data for each subject because there are three within factors (INDEP_A, B, C) with two levels each (0 and 1). I can repeat this analysis in R by using the command: glm(DEP_SUC ~ INDEP_A + INDEP_B + INDEP_C, family = ?binomial?). DEP_SUC is here a table with the successes and misses per row. Again, there are eight rows for each subject. But while I know how to group these lines in Stata by using the option i(ID ), I do not know what to do in R. I have se! ar! ch for more information about the i() command, but did not find any usefull information. So, to summarize: I want to find out how three variables (binary) influence a proportion and use logistic regression. In Stata I can group multiple lines per subject using the i( ) command in logistic regression. What is the equivalent in R? Thank you in advance! __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Help with pedigree() function in kinship2
Your problem is that PatientID, FatherID, MotherID are factors. The authors of kinship2 (myself and Jason) simply never thought of someone doing this. Yes, that is an oversight. We will correct it by adding some more checks and balances. For now, turn your id variables into character or numeric. Terry Therneau On 05/13/2015 05:00 AM, r-help-requ...@r-project.org wrote: Dear R-help, I am interested in plotting some pedigrees and came across the kinship2 package. What follows is an example of the pedigrees I am working with Now, when running ## check package availability if(!require(kinship2)) install.packages('kinship2') require(kinship2) ## data to plot d - structure(list(FamilyID = c(1, 1, 1, 1, 1, 1, 1, 1, 1), PatientID = structure(c(2L, 3L, 5L, 11L, 12L, 15L, 16L, 17L, 6L), .Label = c( 1, 2, 3, 4, 5, 6, 7, 9, 10, 11, 13, 14, 18, 20, 23, 24, 25, 27, 28, 29, 30, 31, 33, 34, 35, 37, 38, 39, 41, 43, 45, 50, 62, 63, 64, 65, 66, 67, 85, 88), class = factor), FatherID = structure(c(1L, 1L, 6L, 1L, 5L, 6L, 1L, 7L, 6L), .Label = c(0, 1, 10, 11, 13, 2, 23, 27, 28, 3, 33, 34, 35, 38, 5, 62, 64, 66, 9), class = factor), MotherID = structure(c(1L, 1L, 7L, 1L, 14L, 7L, 1L, 5L, 7L), .Label = c(0, 10, 18, 2, 24, 29, 3, 30, 33, 34, 39, 4, 43, 5, 6, 63, 65, 9), class = factor), Sex = structure(c(2L, 1L, 1L, 2L, 2L, 2L, 1L, 2L, 2L), .Label = c(Female, Male), class = factor), AffectionStatus = structure(c(1L, 1L, 2L, 1L, 2L, 2L, 1L, 2L, 2L), .Label = c(1, 2), class = factor)), .Names = c(FamilyID, PatientID, FatherID, MotherID, Sex, AffectionStatus ), row.names = c(NA, 9L), class = data.frame) ## plotting ped - with(d, pedigree(PatientID, FatherID, MotherID, Sex, affected = AffectionStatus, famid = FamilyID)) ## Error in pedigree(PatientID, FatherID, MotherID, Sex, affected = AffectionStatus, : ##Value of 'dadid' not found in the id list 1/0 1/0 1/2 1/0 1/2 I get an error. My sessionInfo() is at the end. I was wondering if someone could help me to dissect what the cause of this error is and how it can be fixed. Thank you very much for your help. Best regards, Jorge Velez.- __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Predict in glmnet for Cox family
On 04/21/2015 05:00 AM, r-help-requ...@r-project.org wrote: Dear All, I am in some difficulty with predicting 'expected time of survival' for each observation for a glmnet cox family with LASSO. I have two dataset 5 * 450 (obs * Var) and 8000 * 450 (obs * var), I considered first one as train and second one as test. I got the predict output and I am bit lost here, pre - predict(fit,type=response, newx =selectedVar[1:20,]) s0 1 0.9454985 2 0.6684135 3 0.5941740 4 0.5241938 5 0.5376783 This is the output I am getting - I understood with type response gives the fitted relative-risk for cox family. I would like to know how I can convert it or change the fitted relative-risk to 'expected time of survival' ? Any help would be great, thanks for all your time and effort. Sincerely, The answer is that you cannot predict survival time, in general. The reason is that most studies do not follow the subjects for a sufficiently long time. For instance, say that the data set comes from a study that enrolled subjects and then followed them for up to 5 years, at which time 35% had experienced mortality (using the usual Kaplan-Meier). Fit a model to the data and ask what is the predicted survival time for a low risk subject. The answer will at best be greater than 5 years. The program cannot say if it is 6 or 10 or even 1000. A bigger data set does not help. Terry Therneau __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Coxme penalized log-likelihood mismatch
The perils of backwards compatability During computation the important quantity is loglik + penalty. That is what is contained in the third element of the loglik vector. Originally that is also what was printed, but I later realized that for statistical inference one wants the loglik and penalty to be separate, e.g. the AIC is based on the loglik and the degrees of freedom. So I changed the printout to give loglik at beta=0, integrated loglik at solution, ordinary loglik at solution (Integrated and ordinary are the same at beta=0). But, because I was worried that some people would have been reading the results in a program (like you are), I didn't want to break their code so left the internal variable alone. It appears that I traded maybe safer for certain confusion. Terry Therneau On 04/09/2015 05:00 AM, r-help-requ...@r-project.org wrote: Hi,? I need to extract the penalized log-likehood term from coxme objects but I find the values stored whitin the object different than the penalized term given in the summary output of coxme function. Both the Null and the Integrated values are identical but the penalized is always off.? Any thoughts on why and how i can extract the right value to compute the AIC myself? (I know an AIC value is given in the output but I need to compute it myself inside a loop)? Many thanks,? Alexandre Lafontaine __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Error in thielsen
I have no idea. A data set that generates the error would be very helpful to me. What is the role of the last line BTW, the one with 1% on it? Looking at the code I would guess that the vector tied has an NA in it, but how that would happen I can't see. There is a reasonable chance that it will become obvious once I have an example. Terry Therneau On 03/06/2015 05:00 AM, r-help-requ...@r-project.org wrote: I don?t understand an error message from a thielsen function call within a dplyr do function call. by.CaseNo - group_by(as.data.frame(MAP.dt), CaseNo) MAP.thielsen - by.CaseNo %% + do(model = thielsen(noninvMAP ~ invMAP, symmetric = T, data = ., + x = T, y = T, model = T, nboot = 1000)) | | 1% ~40 m remaining Error in if (any(tied)) { : missing value where TRUE/FALSE needed __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Sorting Surv objects
Your work around is not as easy looking to me. Survival times come in multiple flavors: left censored, right censored, interval censored, left-truncated and right censored, and multi-state. Can you give me guidance on how each of these should sort? If a sort method is added to the package it needs to deal with all of these. Professor Ripley has pointed out that the default action of sort() for right censored times, which I agree is reasonable. Terry Therneau (author of the survival package) On 02/13/2015 05:00 AM, r-help-requ...@r-project.org wrote: It seems that Surv objects do not sort correctly. This seems to be a bug. Anyone else found this? survival.data [1] 4+ 3 1+ 2 5+ class(survival.data) [1] Surv sort(survival.data) [1] 2 1+ 4+ 3 5+ An easy work-around is to define a function sort.Surv __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] How to read this Rpart decision tree
First: summary(ss.rpart1) or summary(ss.rpart, file=whatever) The printout will be quite long since your tree is so large, so the second form may be best followed by a perusal of the file with your favorite text editor. The file name of whatever above should be something you choose, of course. This will give you a full description of the tree. Read the first node or two very carefully so that you understand what the fit did. Plotting routines for trees have to make display choices, since there simply is not enough space available to list all the details. You have a complicated endpoint with at least 14 different products. The predicted value for the each node of the tree is a vector of percentages (one per product, adds to one); plots often show only the name of the most frequent. The alive/dead endpoint for the Titanic data is a lot easier to fit into a little plotting oval so of course the plotted tree is easier to grasp. Terry T. On 02/11/2015 05:00 AM, r-help-requ...@r-project.org wrote: Hi all, In the attachment or this link (http://oi58.tinypic.com/35ic9qc.jpg) you'll find the decision tree I made. I used the Rpart package to make the tree and the rattle package using the fancyRpartPlot to plot it. The data in the tree looks different than about every example I have seen before. I don't understand how I should read it. I want to predict Product (which are productkeys). The variables to predict it contain age, incomegroup, gender, totalchildren, education, occupation, houseownerflag, numberCars.It looks like the upper number is a ProductKey. The n is number of observations? And the percentage of the yes/no question below. This is the code I used. ss.rpart1 - rpart(Product ~ ., data=sstrain, control=rpart.control(minbucket=2,minsplit=1, cp=-1)) spt - which.min(ss.rpart1$cptable[, xerror]) scp - ss.rpart1$cptable[opt, CP] ss.rpart2 - prune(ss.rpart1, cp=cp) fancyRpartPlot(ss.rpart2) So why does the tree looks so different from the most (for example:http://media.tumblr.com/a9f482ff88b0b9cfaffca7ffd46c6a8e/tumblr_inline_mz7pyuaYJQ1s5wtly.png). This is from Trevor Stephen's TItanic tutorial. The first node show that 62% of 100% doesn't survive. If they were male, only 19% of them were survivors. I find that a lot examples look like that. Why does mine predict per ProductKey and every node it has something else. it doesn't make sense to me. And it doesn't have the two numbers like .62 and .38 but it has n=197e+3. So should I read the first node like For 100% of the observations of ProductKey 1074, the incomegroup was moderate)? Thank you! Kim __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] R example codes for direct standardization of rates
The pyears() and survexp() routines in the survival package are designed for these calculations. See the technical report #63 of the Mayo Biostat group for examples http://www.mayo.edu/research/departments-divisions/department-health-sciences-research/division-biomedical-statistics-infomatics/technical-reports http://www.mayo.edu/research/departments-divisions/department-health-sciences-research/division-biomedical-statistics-informatics/technical-reports Terry Therneau -- begin included message --- I am looking for R example codes to compute age-standardized death rates by smoking and psychological distress status using person-years of observation created from the National Health Interview Survey Linked Mortality Files. Any help with the example codes or references will be appreciated. [[alternative HTML version deleted]] __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] half-logistic distribution
On 12/26/2014 05:00 AM, r-help-requ...@r-project.org wrote: i want to analyse survival data using typeI HALF LOGISTIC DISTRIBUTION.how can i go about it?it installed one on R in the survival package didn't include the distribution...or i need a code to use maximum likelihood to estimate the parameter in survival analysis.a typical example of distribution other than that installed in R will help.thanks I am the author of the survival package, and had never heard of the type I half-logistic before. New distributions can be added to survreg() if they can be represented as location-scale families; I don't think that your distribution can be written in that way. Look at survival under the Task Views tab (upper left corner) of the cran.org web page -- someone else may have already done it. Terry T. __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Cox model with multiple events - PH assumption
On 12/23/2014 05:00 AM, r-help-requ...@r-project.org wrote: Dear all, I'm using the package survival for adjusting the Cox model with multiple events (Prentice, Williams and Peterson Model). I have several covariates, some of them are time-dependent. I'm using the functioncox.zph to check the proportional hazards. Due to the nature of the time-dependent covariates, I don't need to analyse the assumptions of the proportional hazards associated with the time-dependent covariates. Is it right? ?Thanks for your attention. Best regards, Helena. Wrong. The PH assumption is that the same coefficient b for a covariate applies over all follow-up time. The fact that the covariate itself changes value, or does not change value, over time has no bearing on whether the assumption is true. Now it may often be the case that risk is related to current covariate values, and a Cox model using baseline values fails just because the covariates are out of date. So PH might hold for the updated (time-dependent) covariates and fail when using baseline values. This is a very study specific situation, however. Terry T. __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] exclude missing co-variable data in cox model
Three responses to your question 1. Missing values in R are denoted by NA. When reading in your data you want to use the na.strings option so that the internal form of the data has missing values properly denoted. 2. If this is done, then coxme will notice the missings and remove them, you do not need to do anything. Your second question of how to use the missing data is a much deeper statistical one. Multiple imputation would be the obvious way to proceed, but it is complex and I have no experience with respect to how it would work with a mixed effects Cox model. 3. Your note implies that output was attached. Note that r-help strips away all attachments, so none of us saw it. Terry Therneau On 12/19/2014 05:00 AM, r-help-requ...@r-project.org wrote: Hi all, I have a data set like this: Test.cox file: V1V2 V3Survival Event ann 13 WTHomo 41 ben 20 *51 tom 40 Variant 61 where * indicates that I don't know what the value is for V3 for Ben. Remainder of message excluded __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] How to pass a nested random effect with frailty in coxph()
Use the coxme funtion (package coxme), which has the same syntax as lme4. The frailty() function in coxph only handles the simple case of a random intercept. Terry Therneau On 12/12/2014 05:00 AM, r-help-requ...@r-project.org wrote: Hi, I have a very simple Cox regression model in which I need to include a nested random effect: individual nested in treatment. I know how to pass a single random effect - e.g. frailty(id)- but how can I specify the nested random (id nested in treatment) effect using frailty? The equivalent in lme4 would be (treatment|id) Thanks! David __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] CMD check error
I have a new package (local use only). R CMD check fails with a messge I haven't seen before, and I haven't been able to guess the cause. There are two vignettes, both of which have %\VignetteIndexEntry lines. Same failure both under R-3.1.1 and R-devel, so it's me and not R. Linux OS. Hints anyone? Terry Therneau = tmt% R CMD build dart * preparing 'dart': * checking DESCRIPTION meta-information ... OK * installing the package to build vignettes * creating vignettes ... OK * checking for LF line-endings in source and make files * checking for empty or unneeded directories * looking to see if a 'data/datalist' file should be added * building 'dart_1.0-2.tar.gz' tmt% R CMD check dart*gz ... Installation failed. See '/people/biostat2/therneau/consult/bsi/dart.Rcheck ... tmt% more dart.Rcheck/00install.out ... ** installing vignettes Warning in file(con, w) : cannot open file '/people/biostat2/therneau/consult/bsi/dart.Rcheck/dart/doc/ index.html': No such file or directory Error in file(con, w) : cannot open the connection ERROR: installing vignettes failed * removing '/people/biostat2/therneau/consult/bsi/dart.Rcheck/dart' [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] CMD check error
[Picture of a hand slapping forehead] I suspected it to be something either dumb or obvious, and managed both. This library plugs into the tail of a long developement project for a web API, which had prior documentation, which I stuck in doc. And then added to .Rbuildignore. Thanks Hadley Terry T. On 11/18/2014 08:47 AM, Hadley Wickham wrote: Do you have a .Rbuildignore? If so, what's in it? Hadley On Tue, Nov 18, 2014 at 7:07 AM, Therneau, Terry M., Ph.D. thern...@mayo.edu wrote: I have a new package (local use only). R CMD check fails with a messge I haven't seen before, and I haven't been able to guess the cause. There are two vignettes, both of which have %\VignetteIndexEntry lines. Same failure both under R-3.1.1 and R-devel, so it's me and not R. Linux OS. Hints anyone? Terry Therneau = tmt% R CMD build dart * preparing 'dart': * checking DESCRIPTION meta-information ... OK * installing the package to build vignettes * creating vignettes ... OK * checking for LF line-endings in source and make files * checking for empty or unneeded directories * looking to see if a 'data/datalist' file should be added * building 'dart_1.0-2.tar.gz' tmt% R CMD check dart*gz ... Installation failed. See '/people/biostat2/therneau/consult/bsi/dart.Rcheck ... tmt% more dart.Rcheck/00install.out ... ** installing vignettes Warning in file(con, w) : cannot open file '/people/biostat2/therneau/consult/bsi/dart.Rcheck/dart/doc/ index.html': No such file or directory Error in file(con, w) : cannot open the connection ERROR: installing vignettes failed * removing '/people/biostat2/therneau/consult/bsi/dart.Rcheck/dart' [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] cluster + tt terms in coxph
This is fixed in version 2.37-8 of the survival package, which has been in my send to CRAN real-soon-now queue for 6 months. Your note is a prod to get it done. I've been updating and adding vignettes. Terry Therneau On 11/05/2014 05:00 AM, r-help-requ...@r-project.org wrote: I am receiving the following error when trying to include both tt (time transforms) and frailty terms in coxph coxph(Surv(time, status) ~ ph.ecog + tt(age)+cluster(sex), data=lung, + tt=function(x,t,...) pspline(x + t/365.25)) Error in residuals.coxph(fit2, type = dfbeta, collapse = cluster, weighted = TRUE) : Wrong length for 'collapse' I tried both 64 bit (R.3.1.0) and 32 bit (R.3.1.2) in Windows 7 64bit and get the same errors Inclusion of tt and cluster terms worked fine in R2.9.2-2.15.1 under Windows Vista 32 bit and Ubuntu 64 bit Any ideas? __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] merge by time, certain value if 5 min before and after an event
I've attached two functions used locally. (The attachments will be stripped off of the r-help response, but the questioner should get them). The functions neardate and tmerge were written to deal with a query that comes up very often in our medical statistics work, some variety of get the closest creatinine value to the subject's date of rehospitalization, at least one week before but no more than 1 year prior, or tasks that merge two data sets to create a single (start, stop] style one. The neardate function is a variant on match(). Given two (id, date) pairs it will find the first pair in list 2 that has date2 = date1 (or =) and the same id. The second variable can be any orderable class, but dates are the most common use and hence the name. These are being added to the survival package release that should be out real-soon-now, once I add some extended examples of their use to the time dependent covariates vignette. Terry Therneau On 10/03/2014 05:00 AM, r-help-requ...@r-project.org wrote: Hello! I hope someone can help me. It would save me days of work. Thanks in advance! I have two dataframes which look like these: myframe - data.frame (Timestamp=c(24.09.2012 09:00:00, 24.09.2012 10:00:00, 24.09.2012 11:00:00), Event=c(low,high,low) ) myframe mydata - data.frame ( Timestamp=c(24.09.2012 09:05:01, 24.09.2012 09:49:50, 24.09.2012 09:51:01, 24.09.2012 10:04:50, 24.09.2012 10:05:10) , location=c(1,2,3,1,5) ) mydata # I want to merge them by time so I have a dataframe which looks like this in the end (i.e. Low during 5 min before and after high ) result - data.frame ( Timestamp=c(24.09.2012 09:05:01, 24.09.2012 09:49:50, 24.09.2012 09:51:01, 24.09.2012 10:04:50, 24.09.2012 10:05:10) , location=c(1,2,3,1,5) , Event=c(low, low,high,high,low)) result Anyone knows how do merge them? Best regards, Dagmar # # Create a nearest date index # date1: the trial date # date2: target to match to # # result: an index vector for data set 1, which shows the row in data set # 2 that has the same id, and the best date. # # best = after The closest date in #2 that is on or after the date in #1 #prior The closest date in #2 that is on or before the date in #1 # neardate - function(date1, date2, id1, id2, best=c(after, prior)) { if (!missing(id1)) { if (length(id1) != length(date1)) stop(id1 and date1 have different lengths) if (missing(id2)) stop(either both or neither of id1 and id2 must be supplied) if (class(id1) != class(id2) !(is.numeric(id1) is.numeric(id2))) stop(id1 and id2 are different data types) if (length(id2) != length(date2)) stop(id2 and date2 have different lengths) } else if (!missing(id2)) stop(either both or neither of id1 and id2 must be supplied) if (class(date1) != class(date2) !(is.numeric(date1) is.numeric(date2))) stop(date1 and date2 are diffrent data types) if (is.factor(date1) !is.ordered(date1)) stop(date1 cannot be a factor) date1 - as.numeric(date1) # this will fail for silly inputs like a list date2 - as.numeric(date2) best - match.arg(best) # Throw out missing dates in the second arg, but remember which ones rowid - 1:length(date2) if (any(is.na(date2))) { toss - is.na(date2) date2 - date2[!toss] if (!missing(id2)) id2 - id2[!toss] rowid - rowid[!toss] } n2 - length(date2) if (n2 ==0) stop(No valid entries in data set 2) # Simpler case: no id variables rowid - 1:length(date2) if (missing(id1)) { if (best==prior) indx2 - approx(date2, 1:n2, date1, method=constant, yleft=NA, yright=n2, rule=2, f=0)$y else indx2 - approx(date2, 1:n2, date1, method=constant, yleft=1, yright=NA, rule=2, f=1)$y return(rowid[indx2]) } # match id values as well # First toss out any rows in id2 that are not possible targets for id1 # (id2 is usually the larger data set, thinning speeds it up) indx1 - match(id2, id1) toss - is.na(indx1) if (any(toss)) { id2 - id2[!toss] date2 - date2[!toss] indx1 - indx1[!toss] rowid - rowid[!toss] } n2 - length(date2) if (n2 ==0) stop(No valid entries in data set 2) # We need to create a merging id. A minimal amount of # spread for the dates keeps numeric overflow at bay alldate - sort(unique(c(date1, date2))) date1 - match(date1, alldate) date2 - match(date2, alldate) delta - 1.0 + length(alldate) #numeric, not integer, on purpose hash1 - match(id1, id1)*delta + date1 hash2 - indx1*delta + date2 if (best==prior) indx2 - approx(hash2, 1:n2, hash1, method=constant, yleft=NA, yright=n2, rule=2, f=0)$y else indx2 - approx(hash2, 1:n2, hash1,
[R] c() and dates
I'm a bit puzzled by a certain behavior with dates. (R version 3.1.1) temp1 - as.Date(1:2, origin=2000/5/3) temp1 [1] 2000-05-04 2000-05-05 temp2 - as.POSIXct(temp1) temp2 [1] 2000-05-03 19:00:00 CDT 2000-05-04 19:00:00 CDT So far so good. On 5/4, midnight in Greenwich it was 19:00 on 5/3 in my time zone. The manual page has a clear explanation of what goes on. c(temp1, temp2) [1] 2000-05-042000-05-052623237-10-15 2623474-05-06 class(c(temp1, temp2)) [1] Date c(temp2, temp1) [1] 2000-05-03 19:00:00 CDT 2000-05-04 19:00:00 CDT [3] 1969-12-31 21:04:41 CST 1969-12-31 21:04:42 CST class(c(temp2, temp1)) [1] POSIXct POSIXt I would have expected c() to determine a common class, somehow, then do the conversion and concatonate. That is obviously not what happens. I've read the manual page but I must be missing something. I make no claim that R is broken, mistaken, or otherwise deficient, only that my understanding is so. Could someone illuminate? Terry T. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] c() and dates
Well duh -- type c.Date at the command prompt to see what is going on. I suspected I was being dense. Now that the behaior is clear can I follow up on David W's comment that redfining the c.Date function as structure(c(unlist(lapply(list(...), as.Date))), class = Date) allows for a more intellegent response, since it allows all of the as.Date machinery to be brought into play. It seems like a good idea in general. Would it be a good exchange between the current nonsense result, no warning and the new error messages that would arise, e.g., from c(as.Date(2000/10/1), factor('b')). Terry T. On 10/03/2014 09:52 AM, peter dalgaard wrote: S3 only has single dispatch, so in one case it dispatches to c.Date and in the other to c.POSIXct, both of those return an object of the corresponding class. In both cases, the arguments pass through c(unlist(lapply(list(...), unclass))) which doesn't look at the class at all. Since Date objects unclass to days and POSIXct to seconds, something is bound to go wrong. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] CoxME: Family relatedness
I would have caught this tomorrow (I read the digest). Some thoughts: 1. Skip the entire step of subsetting the death.kmat object. The coxme function knows how to do this on its own, and is more likely to get it correct. My version of your code would be deathdat.kmat - 2* with(deathdat, makekinship(famid, id, faid, moid)) model3 - coxme(Surv(Survival, Event) ~ Sex + strata(cohort) + SNP1 + SNP2 + SNP3 + (1|id), data=death.dat, varlist=deathdat.kmat) This all assumes that the id variable is unique. If family 3 and family 4 both have an id of 1, then the coxme call can't match up rows in the data to rows/cols in the kinship matrix uniquely. But that is simple to fix. The kinship matrix K has .5 on the diagonal, by definition, but when used as a correlation most folks prefer to use 2K. (This causes mixups since some software adds the 2 for you, but coxme does not.) 2. The model above is the correct covariance structure for a set of families. There is a single intercept per subject, with a complex correlation matrix. The simpler per family frailty model would be model4 - coxme(Surv(Survival, Event) ~ Sex + strata(cohort) + SNP1 + SNP2 + SNP3 + (1|famid), death.dat) This model lets each family have a separate risk, with everyone in the same family sharing the exact same risk. It is less general than model3 above which lets a family have higher risk plus has variation between family members. A model with both per-subject and per family terms is identical to one with a covariance matrix of s1 K + s2 B, where K is the kinship matrix, B is a block diagonal matrix which has a solid block of 1 for each family, and s1 s2 are the fitted variance coefficients. I don't find this intuitive, but have seen the argument that B is a shared environmental effect. (Perhaps because I have large family trees where they do not all live together). If you want such a model: model5 - coxme(.. + (1|id) + (1|famid), death.dat, varlist=deathdat.kmat) (When the varlist is too short the program uses the default for remaining terms). 3. To go further you will need to tell us what you are trying to model, as math formulas not as R code. 4. The error messages you got would more properly read I'm confused on the part of the program. They cases of something I would never do, so never got that message. Therefore useful for me to see. Continuous variables go to the left of the | and categoricals to the right of the |. Having a family id to the left makes no sense at all. Terry Therneau On 09/15/2014 03:20 PM, Marie Dogherty wrote: Dr. Therneau, I was wondering if you had a spare minute, if you could view a post in the R forum: http://r.789695.n4.nabble.com/CoxME-family-relatedness-td4696976.html I would appreciate it, I'm stuck and out of ideas! Many thanks Marie. ---Original post -- Hello all, I have a table like this, with ~300 individuals: Famid Id Faid Moid Cohort Sex Survival Event SNP1 SNP2 SNP3 11000010 1010 22001120 1000 23000025 1010 45120035 1011 46120035 0101 famid=family id, id=id of person,faid=father id,moid=mother id. My question of interest: What impact does SNP1/SNP2/SNP3 have on survival of individuals (Id), after accounting for possible effects due to family relatedness (Famid). So I want to account for family-specific frailty effects, and individual frailty effects according to degree of relationship between individuals. The commands I've used are from this paper: http://www.ncbi.nlm.nih.gov/pubmed/21786277 Library(survival) Library(coxme) Library(kinship2) Library(bdsmatrix) Death.dat -read.table(“Table”,header=T) #Make a kinship matrix for the whole study deathdat.kmat -makekinship(famid=death.dat$famid,id=death.dat$id,father=death.dat$faid,mother=death.dat$moid) ##omit linker individuals with no phenotypic data, used only to ##properly specify the pedigree when calculating kinship ##coefficints: death.dat1-subset(death.dat,!is.na(Survival)) ##temp is an array with the indices of the individuals with Survival years: all -dimnames(deathdat.kmat)[[1]] temp -which(!is.na(death.dat$Survival[match(all,death.dat$id)])) ##kinship matrix for the subset with phenotype Survival: deathdat1.kmat -deathdat.kmat[temp,temp] If I type: model3 -coxme(Surv(Survival,Event)~Sex+strata(Cohort)+SNP1+SNP2+SNP3+(1+id|famid),data=death.dat,varlist=deathdat1.kmat) I get: “Error in coxme(Surv(Survival, Event) ~ Sex + strata(Cohort) + SNP1 + : In random term 1: Mlist cannot have both covariates and grouping” Whereas if I type: model3 -coxme(Surv(Survival,Event)~Sex+strata(Cohort)+SNP1+SNP2+SNP3,(id|famid),data=death.dat1,varlist=deathdat1.kmat) I get: Error in coxme(Surv(Survival, Event) ~ Sex +
Re: [R] Linear relative rate / excess relative risk models
On 07/30/2014 05:00 AM, r-help-requ...@r-project.org wrote: A while ago, I inquired about fitting excess relative risk models in R. This is a follow-up about what I ended up doing in case the question pops up again. While I was not successful in using standard tools, switching to Bayesian modeling using rstan (mc-stan.org/rstan.html) worked better. The results closely match those from Epicure. Using the data here:http://dwoll.de/err/dat.txt The stan model fit below replicates the results from Epicure here:http://dwoll.de/err/epicure.log Of course I am still interested in learning about other options or approaches. Daniel A good paper on issues and solutions is Lumley, Kronmal and Ma, 2006, Relative Risk Regression in Medical Research: Models, Contrasts, Estimators, and Algorithms __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] plot for survreg and intcox (rewritten)
I missed this question. 1. For survreg. help(predict.survreg) shows an example of drawing a survival curve Adding a survfit method has been on my list for a long time, since it would make this information easier to find. 2. intcox. I had not been familiar with this function. Even though the authors have decided to label the result with a coxph class, presumably so the coxph print function will work, almost none of the coxph methods will work on the result. However, reading their manual page it seems that the hazard and its time points are returned as a part of the result. I tried running an example to see if the result is a hazard or a cumulative hazard but the run failed and it's time to run to a meeting. Assume it is cumulative hazard. Then exp(-fit$lambda0) will give the baseline survival curve. Terry T On 08/27/2014 02:47 PM, David Winsemius wrote: On Aug 26, 2014, at 3:34 PM, Silong Liao wrote: Hello, My data is interval censoring, and I use intcox package which deals with this situation. I guess survest is only for survfit, and I would like to plot parametric model survreg.Since I use dist=weibull, I tried curve function but plot seems unreasonable. mod.reg1=survreg(s_new~type+sex+eye+age+preopiop+preopva,dist=weibull) intercept=-19.155 scale=8.12 curve(pweibull(x,scale=exp(coef(mod.reg1)),shape=1/mod.reg1$scale,lower.tail=FALSE),from=0,to=40) You are asked to reply to the list. ( I am taking the liberty of forwarding to Terry Therneau since I don't want to misquote him.) I remember Terry Therneau saying that plotting survival curves from interval-censored data seemed difficult at best and nonsensical at worst. (If it's difficult for him it's likely to be impossible for me.) I don't know if the assumption of a parametric form might aid in that effort. I also remember warnings (in both the documentation and repeated many times by Therneau on rhelp) that the parameters of the weibull distribution used in the survival package were different than those used in the stats package weibull function. -- David. Subject: Re: [R] plot for survreg and intcox (rewritten) From:dwinsem...@comcast.net Date: Tue, 26 Aug 2014 14:57:16 -0700 CC:r-help@r-project.org To:silong.l...@hotmail.com On Aug 26, 2014, at 2:33 PM, Silong Liao wrote: Dear R users, I'm trying to plot survival probability against time(in years) using survreg and intcox. Please can you help me with this problem? (I have rewritten using plain text.) I tried to use curve function but have no clue. I suspect you want survfit (in the survial package which is where I suspect survreg is coming from. It returns an object that has a plot method. You could also scroll through help(pack=survival) to see other plotting functions. You could also use survest in the rms package. For survreg, mod.reg1=survreg(s_new~type+sex+eye+preopiop+preopva,dist=weibull) summary(mod.reg1) Call: survreg(formula = s_new ~ type + sex + eye + preopiop + preopva, dist = weibull) Value Std. Error z p (Intercept) 40.539 20.582 1.970 4.89e-02 typeTrab -6.606 4.279 -1.544 1.23e-01 sexM -1.055 3.765 -0.280 7.79e-01 eyeR -2.112 3.587 -0.589 5.56e-01 preopiop -0.308 0.269 -1.147 2.52e-01 preopva -0.461 1.771 -0.260 7.95e-01 Log(scale) 2.058 0.285 7.222 5.12e-13 Scale= 7.83 Weibull distribution Loglik(model)= -78.7 Loglik(intercept only)= -81.4 Chisq= 5.37 on 5 degrees of freedom, p= 0.37 Number of Newton-Raphson Iterations: 10 n= 339 For intcox, You are asked to provide the package name for functions that are not in the base or default packages. I have quite a few packages loaded including survival_2.37-7 , coxme_2.2-3, and rms_4.2-0 but I get: ?intcox No documentation for ‘intcox’ in specified packages and libraries: you could try ‘??intcox’ -- David. cox.fit=intcox(s_new~type+eye+sex+age+preopiop+preopva,data=glaucoma_new) summary(cox.fit) Call: intcox(formula = s_new ~ type + eye + sex + age + preopiop +preopva, data = glaucoma_new) n= 339 coef exp(coef) se(coef) z Pr(|z|) typeTrab 0.59391 1.81106 NA NA NA eyeR 0.28419 1.32868 NA NA NA sexM -0.11597 0.89050 NA NA NA age -0.06556 0.93655 NA NA NA preopiop 0.03903 1.03980 NA NA NA preopva -0.05517 0.94632 NA NA NA exp(coef) exp(-coef) lower .95 upper .95 typeTrab 1.8111 0.5522 NA NA eyeR 1.3287 0.7526 NA NA sexM 0.8905 1.1230 NA NA age 0.9365 1.0678 NA NA preopiop 1.0398 0.9617 NA NA preopva 0.9463 1.0567 NA NA Rsquare= NA (max possible= 0.327 ) Likelihood ratio test= NA on 6 df, p=NA Wald test = NA on 6 df, p=NA Score (logrank) test = NA on 6 df, p=NA __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guidehttp://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. David Winsemius Alameda, CA, USA Rplot.jpeg
Re: [R] Cox regression model for matched data with replacement
On 08/13/2014 05:00 AM, John Purda wrote: I am curious about this problem as well. How do you go about creating the weights for each pair, and are you suggesting that we can just incorporate a weight statement in the model as opposed to the strata statement? And Dr. Therneau, let's say I have 140 cases matched with replacement to 2 controls. Is my id variable the number of cases? The above has an incorrect assumption that I notice ALL survival questions on the list -- which was false in this case. Could you clue me in as to the original question and discussion -- assuming that you want Dr Therneau to respond intelligently :-) Terry T. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Cox regression model for matched data with replacement
Ok, I will try to do a short tutorial answer. 1. The score statistic for a Cox model is a sum of (x - xbar), where x is the covariate vector of the subject who had an event, and xbar is the mean covariate vector for the population, at that event time. - the usual Cox model uses the mean of {everyone still at risk} as xbar - matched Cox models use a mean of {some subset of those at risk}, and work fine as long as that subset is an honest estimate of xbar. You do, of course, have to sample from those still at risk at the time point, since that is the xbar you are trying to estimate. Someone who dies or is censored at time 10 can't be a control at time 20. - in an ordinary Cox model the program figures out who belongs in each xbar average all on its own, using the time variable. In a matched model you need to supply the who dances with who information. The usual way is to assign each of the sets {subject who died + their controls} to a separate stratum. (If there is only one death in each stratum then the time variable will not be needed and you can plug in a dummy value; this is what clogit does.) You can have more than one control per case by the way. 2. Variance. In the matched model you run the risk, a quite small risk, that the same person would be picked again and again as the control. If this unfortunate thing were to happen then the usual model based variance would be too optimistic --- because of its overdependence on one single subject the fit is more unstable than it looks. Three solutions: a) don't worry about it (my usual approach), b) when selecting controls, ensure that this doesn't happen (classic matched case control), c) use a robust variance. For the latter make sure that each subject in the data set has a unique value for some variable id and add + cluster(id) to the model statement. 3. The most common mistake in matching is to exclude, at a given death time t, any subject with a future event from the list of potential controls at time t. This does not lead to an unbiased estimate of xbar, and the resulting numerical bias in the coefficients is shockingly large. There are more clever ways to pick the subset at each event time, e.g., if you had some prior information on all the subjects that can classify them into high/medium/low risk. Survey sampling principles come into play for selection and the xbar at each time is replaced with an appropriate weighted survey estimate. See various papers by Brian Langholz. Terry T On 08/13/2014 07:26 AM, John Pura wrote: Hi Dr. Therneau, The original question on the forum was: My problem was how to build a Cox model for the matched data (1:n) with replacement. Usually, we can use stratified Cox regression model when the data were matched without replacement. However, if the data were matched with replacement, due to the re-use of subjects, we should give a weight for each pair, then how to incorporate this weight into a Cox model. I also checked the clogit function in survival package, it seems suitable to the logistic model for the matched data with replacement, rather than Cox model. Because it sets the time to a constant. Anyone can give me some suggestions? I’m facing a very similar situation, in which I have multiple controls to multiple cases. How would I go about taking that dependency into account in a Cox model? Is this weighting appropriate and to get robust sandwich estimates, can I take my id variable to be the id for the unique cases? Thanks, John On 08/13/2014 05:00 AM, John Purda wrote: I am curious about this problem as well. How do you go about creating the weights for each pair, and are you suggesting that we can just incorporate a weight statement in the model as opposed to the strata statement? And Dr. Therneau, let's say I have 140 cases matched with replacement to 2 controls. Is my id variable the number of cases? The above has an incorrect assumption that I notice ALL survival questions on the list -- which was false in this case. Could you clue me in as to the original question and discussion -- assuming that you want Dr Therneau to respond intelligently :-) Terry T. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Cox regression model for matched data with replacement
On 08/13/2014 08:38 AM, John Pura wrote: Thank you for the reply. However, I think I may not have clarified what my cases are. I'm studying the effect of radiation treatment (vs. none) on survival. My cases are patients who received radiation and controls are those who did not. I used a propensity score model to match cases to controls in a 1:2 fashion. However, because the matching was done with replacement, some controls were matched to more than one case. How can I go about analyzing this - would frequency weighting work? Thanks, John We went down the wrong path. When people use the word case it almost always refers to subjects who had the outcome. If I read the above correctly you have the more simple situation of subset selection. Subjects were chosen to be in the model without reference to their outcome status, with the goal of balancing treatment wrt other predictive factors. Correct? If so, my preferred modeling strategy, in order. 1. coxph(Surv(time, status) ~ treatment, data=one) Where data set one has one copy of each subject selected to be in the study. If they were nominated twice they still appear once. Optional: give each control a case weight equal to the number of times they were selected. This will better balance the data set wrt the factors. 2. Same model, with covariates. The argument about whether covariates on which you have balanced should be included in the model is as old the hills --- belt AND suspenders? --- with proponents on both sides. Meh. Unless there are too many of course. I still like the 10-20 events per covarate rule to choose the maximum number of predictors. 3. coxph(Surv(time, status) ~ treatment + strata(group), data=two) I veiw this as model 2 with paranoia. The covariate effects are so odd that we'll never model them correctly, so treat each combination as unique. The data set two needs to have each treated subject + their controls in a separate stratum. Even though some controls are in the data set twice, they don't need case weights since they are in any given stratum only once. For any of the above you can add a robust variance. Required if case weights are used. Terry T __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Survival Analysis with an Historical Control
You are asking for a one sample test. Using your own data: connection - textConnection( GD2 1 8 12 GD2 3 -12 10 GD2 6 -52 7 GD2 7 28 10 GD2 8 44 6 GD2 10 14 8 GD2 12 3 8 GD2 14 -52 9 GD2 15 35 11 GD2 18 6 13 GD2 20 12 7 GD2 23 -7 13 GD2 24 -52 9 GD2 26 -52 12 GD2 28 36 13 GD2 31 -52 8 GD2 33 9 10 GD2 34 -11 16 GD2 36 -52 6 GD2 39 15 14 GD2 40 13 13 GD2 42 21 13 GD2 44 -24 16 GD2 46 -52 13 GD2 48 28 9 GD2 2 15 9 GD2 4 -44 10 GD2 5 -2 12 GD2 9 8 7 GD2 11 12 7 GD2 13 -52 7 GD2 16 21 7 GD2 17 19 11 GD2 19 6 16 GD2 21 10 16 GD2 22 -15 6 GD2 25 4 15 GD2 27 -9 9 GD2 29 27 10 GD2 30 1 17 GD2 32 12 8 GD2 35 20 8 GD2 37 -32 8 GD2 38 15 8 GD2 41 5 14 GD2 43 35 13 GD2 45 28 9 GD2 47 6 15 ) hsv - data.frame(scan(connection, list(vac=, pat=0, wks=0, x=0))) hsv - transform(hsv, status= (wks 0), wks = abs(wks)) fit1 - survreg(Surv(wks, status) ~ 1, data=hsv, dist='exponential') temp - predict(fit1, type='quantile', p=.5, se=TRUE) c(median= temp$fit[1], std= temp$se[1]) medianstd 24.32723 4.36930 -- The predict function gives the predicted median survival and standard deviation for each observation in the data set. Since this was a mean only model all n of them are the same and I printed only the first. For prediction they make the assumption that the std error for my future study will be the same as the std from this one, you want the future 95% CI to not include the value of 16, so the future mean will need to be at least 16 + 1.96* 4.369. A nonparmetric version of the argument would be fit2 - survfit(Surv(wks, status) ~ 1, data=hsv) print(fit2) records n.max n.start events median 0.95LCL 0.95UCL 48 48 48 31 21 15 35 Then make the argument that in our future study, the 95% CI will stretch 6 units to the left of the median, just like it did here. This argument is a bit more tenuous though. The exponential CI width depends on the total number of events and total follow-up time, and we can guess that the new study will be similar. The Kaplan-Meier CI also depends on the spacing of the deaths, which is less likely to replicate. Notes: 1. Use summary(fit2)$table to extract the CI values. In R the print functions don't allow you to grab what was printed, summary normally does. 2. For the exponential we could work out the formula in closed form -- a good homework exercise for grad students perhaps but not an exciting way to spend my own afternoon. An advantage of the above approach is that we can easily use a more realistic model like the weibull. 3. I've never liked extracting out the Surv(t,s) part of a formula as a separate statement on another line. If I ever need to read this code again, or even just the printout from the run, keeping it all together gives much better documentation. 4. Future calculations for survival data, of any form, are always tenuous since they depend critically on the total number of events that will be in the future study. We can legislate the total enrollment and follow-up time for that future study, but the number of events is never better than a guess. Paraphrasing a motto found on the door of a well respected investigator I worked with 30 years ago (because I don't remember it exaclty): The incidence of the condition under consideration and its subsequent death rate will both drop by 1/2 at the commencement of a study, and will not return to their former values until the study finishes or the PI retires. Terry T. --- On 07/10/2014 05:00 AM, r-help-requ...@r-project.org wrote: Hello All, I'm trying to figure out how to perform a survival analysis with an historical control. I've spent some time looking online and in my boooks but haven't found much showing how to do this. Was wondering if there is a R package that can do it, or if there are resources somewhere that show the actual steps one takes, or if some knowledgeable person might be willing to share some code. Here is a statement that describes the sort of analyis I'm being asked to do. A one-sample parametric test assuming an exponential form of survival was used to test the hypothesis that the treatment produces a median PFS no greater than the historical control PFS of 16 weeks. A sample median PFS greater than 20.57 weeks would fall beyond the critical value associated with the null hypothesis, and would be considered statistically significant at alpha = .05, 1 tailed. My understanding is that the cutoff of 20.57 weeks was obtained using an online calculator that can be found at: http://www.swogstat.org/stat/public/one_survival.htm Thus far, I've been unable to determine what values were plugged into the calculator to get the cutoff. There's another calculator for a nonparamertric test that can be found at:
Re: [R] Predictions from coxph or cph objects
I've been off on vacation for a few days and so am arriving late to this discussion. Try ?print.survfit, and look at the print.rmean option and the discussion thereof in the Details section of the page. It will answer your question, in more detail than you asked. The option applies to survival curves from coxph fits as well. Short summary 1. For any positive random variable X, mean(x) = integral from 0 to inf of the survival curve. For some reason I found this particular homework problem impossible, back when I was a new grad student, consequently I remember it well. 2. When there is censoring the survival curve never drops to zero, which makes the full integral not evaluable. There are well known responses to this problem, but the mean survival is used by so few that these well known approaches are only known by a small few. Enough people got confused by the resulting truncated mean that I set the default option for print.rmean to don't print it. The downside is that the few who do want a mean (like you) often have trouble discovering how to obtain it. A reference manual (with index) for the survival package is sorely needed. Someday... Terry Therneau Dear R users, My apologies for the simple question, as I'm starting to learn the concepts behind the Cox PH model. I was just experimenting with the survival and rms packages for this. I'm simply trying to obtain the expected survival time (as opposed to the probability of survival at a given time t). I can't seem to find an option from the type argument in the predict methods from coxph{survival} or cph{rms} that will give me expected survival times. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] standard error of survfit.coxph()
1. The computations behind the scenes produce the variance of the cumulative hazard. This is true for both an ordinary Kaplan-Meier and a Cox model. Transformations to other scales are done using simple Taylor series. H = cumulative hazard = log(S); S=survival var(H) = var(log(S)) = the starting point S = exp(log(S)), so var(S) is approx [deriv of exp(x)]^2 * var(log(S)) = S^2 var(H) var(log(log(S)) is approx (1/S^2) var(H) 2. At the time it was written, summary.survfit was used only for printing out the survival curve at selected times, and the audience for the printout wanted std(S). True, that was 20 years ago, but I don't recall anyone ever asking for summary to do anything else. Your request is not a bad idea. Note however that the primary impact of using log(S) or S or log(log(S)) scale is is on the confidence intervals, and they do appear per request in the summary output. Terry T. On 06/28/2014 05:00 AM, r-help-requ...@r-project.org wrote: Message: 9 Date: Fri, 27 Jun 2014 12:39:29 -0700 From: array chiparrayprof...@yahoo.com To:r-help@r-project.org r-help@r-project.org Subject: [R] standard error of survfit.coxph() Message-ID: 1403897969.91269.yahoomail...@web122906.mail.ne1.yahoo.com Content-Type: text/plain Hi, can anyone help me to understand the standard errors printed in the output of survfit.coxph()? time-sample(1:15,100,replace=T) status-as.numeric(runif(100,0,1)0.2) x-rnorm(100,10,2) fit-coxph(Surv(time,status)~x) ??? ### method 1 survfit(fit, newdata=data.frame(time=time,status=status,x=x)[1:5,], conf.type='log')$std.err [,1]??? [,2]??? [,3]??? [,4]?? [,5] ?[1,] 0.0 0.0 0.0 0.0 0. ?[2,] 0.008627644 0.008567253 0.008773699 0.009354788 0.01481819 ?[3,] 0.008627644 0.008567253 0.008773699 0.009354788 0.01481819 ?[4,] 0.013800603 0.013767977 0.013889971 0.014379928 0.02353371 ?[5,] 0.013800603 0.013767977 0.013889971 0.014379928 0.02353371 ?[6,] 0.013800603 0.013767977 0.013889971 0.014379928 0.02353371 ?[7,] 0.030226811 0.030423883 0.029806263 0.028918817 0.05191161 ?[8,] 0.030226811 0.030423883 0.029806263 0.028918817 0.05191161 ?[9,] 0.036852571 0.037159980 0.036186931 0.034645002 0.06485394 [10,] 0.044181716 0.044621159 0.043221145 0.040872939 0.07931028 [11,] 0.044181716 0.044621159 0.043221145 0.040872939 0.07931028 [12,] 0.055452631 0.056018832 0.054236881 0.051586391 0.10800413 [13,] 0.070665160 0.071363749 0.069208056 0.066655730 0.14976433 [14,] 0.124140400 0.125564637 0.121281571 0.118002021 0.30971860 [15,] 0.173132357 0.175309455 0.168821266 0.164860523 0.46393111 survfit(fit, newdata=data.frame(time=time,status=status,x=x)[1:5,], conf.type='log')$time ?[1]? 1? 2? 3? 4? 5? 6? 7? 8? 9 10 11 12 13 14 15 ??? ### method 2 summary(survfit(fit, newdata=data.frame(time=time,status=status,x=x)[1:5,], conf.type='log'),time=10)$std.err ? 1? 2? 3? 4? 5 [1,] 0.04061384 0.04106186 0.03963184 0.03715246 0.06867532 By reading the help of ?survfit.object and ?summary.survfit, the standard error provided in the output of method 1 (survfit()) was for cumulative hazard-log(survival), while the standard error provided in the output of method 2 (summary.survfit()) was for survival itself, regardless of how you choose the value for conf.type ('log', 'log-log' or 'plain'). This explains why the standard error output is different between method 1 (10th row) and method 2. My question is how do I get standard error estimates for log(-log(survival))? Thanks! John __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.