[R] X11 font

2023-08-16 Thread Therneau, Terry M., Ph.D. via R-help
I get the following error out of R, on a newer Ubuntu installation.

Error in `axis()`:
! X11 font -adobe-helvetica-%s-%s-*-*-%d-*-*-*-*-*-*-*, face 1 at size 12 could 
not be loaded
Backtrace:
  1. graphics::matplot(...)
  3. graphics::plot.default(...)
  4. graphics (local) localAxis(...)
  6. graphics:::Axis.default(...)
  7. graphics::axis(side = side, at = at, labels = labels, ...)

The context is pretty simple:

library(survival)
matplot(60:100, 36525* survexp.us[61:101, 1:2, "2014"], col=2:1, lty=1, lwd=2,
     xlab="Age", ylab="Death rate per 100", log='y', type='l', yaxt='n',
     main="US Death Rates")
axis(2, c(1,2,5,10,20, 50), c(1,2,5,10, 20, 50), las=2)

This code works fine on my screen.   The error comes up when I put it into an 
.Rmd file 
and apply rmarkdown::render() to it.

Likely some font file is needed, but what one?

Terry

Version:
R Under development (unstable) (2023-08-01 r84818) -- "Unsuffered Consequences"
Copyright (C) 2023 The R Foundation for Statistical Computing
Platform: aarch64-unknown-linux-gnu

-- 
Terry M Therneau, PhD
Department of Quantitative Health Sciences
Mayo Clinic
thern...@mayo.edu

"TERR-ree THUR-noh"

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] extract from a list of lists

2022-12-27 Thread Therneau, Terry M., Ph.D. via R-help
Thanks everyone for prodding my gray matter, which seems to be getting stiffer 
as I 
approach 70 (< 90 days).

  --  I'll continue to use the $ or [[ forms.   That will suffice.

--  I thought there might be a base R variant, e.g. something like  extract( 
list, 
element-name); probably cross talk in my brain from the rstan library

-- Gregg's note shows such a function in purr.  But I rather like having as few 
dependencies as possible in a package, one usage is normally not enough, at 
least for 
something this simple.

Terry T.

On 12/27/22 14:38, Bert Gunter wrote:
> Well, I prefer Greg's approach, but if you want to avoid calls to $ or
> `[[` then you could do:
>
> unlist(fits)[ rep(names(fits[[1]]) == 'iter', length(fits))]
>
>
> Cheers,
> Bert
>
> On Tue, Dec 27, 2022 at 9:46 AM Greg Snow<538...@gmail.com>  wrote:
>> Another option is the map family of functions in the purrr package
>> (yes, this depends on another package being loaded, which may affect
>> things if you are including this in your own package, creating a
>> dependency).
>>
>> In map and friends, if the "function" is a string or integer, then it
>> is taken as the piece to be extracted, so you should be able to do
>> something like:
>>
>> library(purrr)
>> map(fits, 'iter')
>> # or
>> map_int(fits, 'iter')
>> # or
>> map_dbl(fits, 'iter')
>>
>> which of the last 2 to use depends on how `iter` is stored.
>>
>> On Tue, Dec 27, 2022 at 10:16 AM Therneau, Terry M., Ph.D. via R-help
>>   wrote:
>>> I not uncommonly have the following paradym
>>>  fits <- lapply(argument, function)
>>>
>>> resulting in a list of function results.   Often, the outer call is to 
>>> mclapply, and the
>>> function encodes some long calculation, e.g. multiple chains in an MCMC.
>>> Assume for illustration that each function returns a list with elements   
>>> beta, loglik,  iter.
>>>
>>> Then  sapply(fits,  function(x) x$iter)
>>> will give me a vector, with the number of iterations used by each instance.
>>>
>>> I've often been suspicious that there is some simple shorthand for the 
>>> "grab all the
>>> elements named iter" that skips the explicit x$iter function.   Am I indeed 
>>> overlooking
>>> something?I don't expect a speed increase, just cleaner code.
>>>
>>> Terry T.
>>>
>>> --
>>> Terry M Therneau, PhD
>>> Department of Quantitative Health Sciences
>>> Mayo Clinic
>>> thern...@mayo.edu
>>>
>>> "TERR-ree THUR-noh"
>>>
>>>  [[alternative HTML version deleted]]
>>>
>>> __
>>> R-help@r-project.org  mailing list -- To UNSUBSCRIBE and more, see
>>> https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fstat.ethz.ch%2Fmailman%2Flistinfo%2Fr-help=05%7C01%7Ctherneau%40mayo.edu%7Cebbda887f4b94223376608dae84a58ae%7Ca25fff9c3f634fb29a8ad9bdd0321f9a%7C0%7C0%7C638077703266213025%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C=wksTglkeQKmMSWWKHLsUijH5A25cH%2FuwxSNBOeNr9Sg%3D=0
>>> PLEASE do read the posting 
>>> guidehttps://nam12.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.r-project.org%2Fposting-guide.html=05%7C01%7Ctherneau%40mayo.edu%7Cebbda887f4b94223376608dae84a58ae%7Ca25fff9c3f634fb29a8ad9bdd0321f9a%7C0%7C0%7C638077703266213025%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C=9GASHCZVgnzNqAUrWimS7UphiqjYLOf50M1qWv6jeN0%3D=0
>>> and provide commented, minimal, self-contained, reproducible code.
>>
>>
>> --
>> Gregory (Greg) L. Snow Ph.D.
>> 538...@gmail.com
>>
>> __
>> R-help@r-project.org  mailing list -- To UNSUBSCRIBE and more, see
>> https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fstat.ethz.ch%2Fmailman%2Flistinfo%2Fr-help=05%7C01%7Ctherneau%40mayo.edu%7Cebbda887f4b94223376608dae84a58ae%7Ca25fff9c3f634fb29a8ad9bdd0321f9a%7C0%7C0%7C638077703266213025%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C=wksTglkeQKmMSWWKHLsUijH5A25cH%2FuwxSNBOeNr9Sg%3D=0
>> PLEASE do read the posting 
>> guidehttps://nam12.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.r-project.org%2Fposting-guide.html=05%7C01%7Ctherneau%40mayo.edu%7Cebbda887f4b94223376608dae84a58ae%7Ca25fff9c3f634fb29a8ad9bdd0321f9a%7C0%7C0%7C638077703266213025%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C=9GASHCZVgnzNqAUrWimS7UphiqjYLOf50M1qWv6jeN0%3D=0
>> and provide commented, minimal, self-contained, reproducible code.

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] MCMC tools

2022-12-27 Thread Therneau, Terry M., Ph.D. via R-help
Is there a convenient package that computes standard covergence summaries for 
and MCMC 
run?    This is something that I likely knew once and have now forgotton.

  More detail:  I'm trying to understand the MCMC done by a particular model 
called 
Subtype and Stage Inference (SuStaIn), suffice it to say that I am cautious of 
some 
details.    The model doesn't fit into the standard packages, so I've set it up 
and run my 
own Metropolis chains.   I don't want to expend energy also creating R-hat, 
ESS, and other 
sensible summaries; even more so to find the inevitable programming mistakes 
I'll make if 
I create them myself.

  For the truly curious.   I have measurments of tau depostion for 86 brain 
regions (43 * 
left/right) of interest from 1925 scans. One hypothesis in dementia research is 
that tau 
deposition occurs over time, in a pattern;  there is likely more than one 
pattern.    The 
algorithm is looking a permutations of the 86 regions, in search of a small 
collection 
that best fits all the subjects.    There is a general background of 
statistical work that 
has shown that ranking is a hard problem, in the sense of having a small 
variance for 
individual ranks.   SuStaIn tends to give small variances.

Terry T.

-- 
Terry M Therneau, PhD
Department of Quantitative Health Sciences
Mayo Clinic
thern...@mayo.edu

"TERR-ree THUR-noh"

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] extract from a list of lists

2022-12-27 Thread Therneau, Terry M., Ph.D. via R-help
I not uncommonly have the following paradym
    fits <- lapply(argument, function)

resulting in a list of function results.   Often, the outer call is to 
mclapply, and the 
function encodes some long calculation, e.g. multiple chains in an MCMC.
Assume for illustration that each function returns a list with elements   beta, 
loglik,  iter.

Then  sapply(fits,  function(x) x$iter)
will give me a vector, with the number of iterations used by each instance.

I've often been suspicious that there is some simple shorthand for the "grab 
all the 
elements named iter" that skips the explicit x$iter function.   Am I indeed 
overlooking 
something?    I don't expect a speed increase, just cleaner code.

Terry T.

-- 
Terry M Therneau, PhD
Department of Quantitative Health Sciences
Mayo Clinic
thern...@mayo.edu

"TERR-ree THUR-noh"

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] how to find "first" or "last" record after sort in R

2021-09-10 Thread Therneau, Terry M., Ph.D. via R-help
I prefer the duplicated() function, since the final code will be clear to a future reader. 
 (Particularly when I am that future reader).


last <- !duplicated(mydata$ID, fromLast=TRUE)  # point to the last ID for each 
subject
mydata$data3[last] <- NA

Terry T.

(I read the list once a day in digest form, so am always a late reply.)

On 9/10/21 5:00 AM, r-help-requ...@r-project.org wrote:

Hello List,
Please look at the sample data frame below:

ID         date1              date2             date3
1    2015-10-08    2015-12-17    2015-07-23

2    2016-01-16    NA                 2015-10-08
3    2016-08-01    NA                 2017-01-10
3    2017-01-10    NA                 2016-01-16
4    2016-01-19    2016-02-24   2016-08-01
5    2016-03-01    2016-03-10   2016-01-19
This data frame was sorted by ID and date1. I need to set the column date3 as missing for 
the "last" record for each ID. In the sample data set, the ID 1, 2, 4 and 5 has 
one row only, so they can be consider as first and last records. the data3 can be set as 
missing. But the ID 3 has 2 rows. Since I sorted the data by ID and date1, the ID=3 and 
date1=2017-01-10 should be the last record only. I need to set date3=NA for this row only.

the question is, how can I identify the "last" record and set it as NA in date3 
column.
Thank you,
Kai
[[alternative HTML version deleted]]



__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] coxph means not equal to means of model matrix

2021-09-03 Thread Therneau, Terry M., Ph.D. via R-help


On 9/3/21 12:59 PM, Bond, Stephen wrote:
>
> I looked at the nocenter and it says (-1,0,1) values but it seems that any 
> three-level 
> factor is included in that (represented as 1,2,3 in R) .
>
A factor is turned into a set of 0/1 dummy variable, so the nocenter applies.� 
I will add 
more clarification to the documentation.

> Also, is the baseline curve now showing the reference level and not the 
> fictional .428 
> sex? If I predict the risk for a new row, should I multiply the coefficient 
> shown in the 
> output by 1 for a sex=1? It used to be (1-.428)*coef.
>
Yes, the "mean" component is the reference level for predict and survfit.� If I 
could go 
back in time it would be labeled as "reference" instead of "mean".�� Another 
opportunity 
for me to make the documentation clearer.

Good questions,
 � Terry T

> Thanks for clarifying.
>
> SB
>
> *From:* Therneau, Terry M., Ph.D. 
> *Sent:* Friday, 3 September, 2021 12:37
> *To:* Bond, Stephen 
> *Cc:* R-help 
> *Subject:* Re: coxph means not equal to means of model matrix
>
> [EXTERNAL]
>
> --
>
> See ?coxph, in particular the new "nocenter" option.
>
> Basically, the "mean" component is used to center later computations.� This 
> can be 
> critical for continuous variables, avoiding overflow in the exp function, but 
> is not 
> necessary for 0/1 covariates.�� The fact that the default survival curve 
> would be for a 
> sex of .453, say, was off-putting to many.
>
> Terry T.
>
> On 9/3/21 11:01 AM, Bond, Stephen wrote:
>
> Hi,
>
> Please, help me understand what is happening with the means of a Cox 
> model?
>
> I have:
>
> R version 4.0.2 (2020-06-22) -- "Taking Off Again"
>
> Copyright (C) 2020 The R Foundation for Statistical Computing
>
> Platform: x86_64-w64-mingw32/x64 (64-bit)
>
> getOption("contrasts")
>
> ��� unordered ordered
>
> "contr.treatment" "contr.poly"
>
> According to the help �coxph.object has a component holding the means of 
> the X
> (model.matrix). This does not hold any more.
>
> ```
>
> library(survival)
>
> test1 <- list(time=c(4,3,1,1,2,2,3),
>
> ���status=c(1,1,1,0,1,1,0),
>
> ���x=c(0,2,1,1,1,0,0),
>
> ���sex=factor(c(0,0,0,0,1,1,1)))
>
> m1 <- coxph(Surv(time, status) ~ x + sex, test1)
>
> m1$means
>
> ##��� x� sex1
>
> ## 0.7142857 0.000
>
> colMeans(model.matrix(m1))
>
> ## x� sex1
>
> ## 0.7142857 0.4285714
>
> ```
>
> Will new observations be scored using the zero mean from the object?? Is 
> this just a
> reporting change where $means shows the reference level and no longer the 
> mean of
> the model matrix??
>
> Thanks everybody
>
> ATTENTION : This email originated outside your organization. Exercise caution 
> before 
> clicking links, opening attachments, or responding with personal information.
>
> --


[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] coxph means not equal to means of model matrix

2021-09-03 Thread Therneau, Terry M., Ph.D. via R-help
See ?coxph, in particular the new "nocenter" option.

Basically, the "mean" component is used to center later computations.� This can 
be 
critical for continuous variables, avoiding overflow in the exp function, but 
is not 
necessary for 0/1 covariates.�� The fact that the default survival curve would 
be for a 
sex of .453, say, was off-putting to many.

Terry T.


On 9/3/21 11:01 AM, Bond, Stephen wrote:
>
> Hi,
>
> Please, help me understand what is happening with the means of a Cox model?
>
> I have:
>
> R version 4.0.2 (2020-06-22) -- "Taking Off Again"
>
> Copyright (C) 2020 The R Foundation for Statistical Computing
>
> Platform: x86_64-w64-mingw32/x64 (64-bit)
>
> getOption("contrasts")
>
> ��� unordered�� ordered
>
> "contr.treatment"� "contr.poly"
>
> According to the help �coxph.object has a component holding the means of the 
> X 
> (model.matrix). This does not hold any more.
>
> ```
>
> library(survival)
>
> test1 <- list(time=c(4,3,1,1,2,2,3),
>
> ���status=c(1,1,1,0,1,1,0),
>
> ���x=c(0,2,1,1,1,0,0),
>
> ���sex=factor(c(0,0,0,0,1,1,1)))
>
> m1 <- coxph(Surv(time, status) ~ x + sex, test1)
>
> m1$means
>
> ##��� x� sex1
>
> ## 0.7142857 0.000
>
> colMeans(model.matrix(m1))
>
> ## x� sex1
>
> ## 0.7142857 0.4285714
>
> ```
>
> Will new observations be scored using the zero mean from the object?? Is this 
> just a 
> reporting change where $means shows the reference level and no longer the 
> mean of the 
> model matrix??
>
> Thanks everybody
>


[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] syvcoxph and cox.zph for testing the PH assumption

2021-07-12 Thread Therneau, Terry M., Ph.D. via R-help




On 7/11/21 5:00 AM, r-help-requ...@r-project.org wrote:

Hello, is it kosher to call cox.zph on a syvcoxph model fit? I see that
someone proposed a modified version of cox.zph that uses resid(fit,
'schoenfeld', **weighted=TRUE**).

https://stats.stackexchange.com/questions/265307/assessing-proportional-hazards-assumption-of-a-cox-model-with-caseweights
Is that all it takes?
Thanks,
Youyi


The cox.zph function does a formal score test.  No, it does not account for robust 
variance.  I hadn't considered that case, but will now think about it.  It is quite easy 
to show that there is a problem: just give everyone a weight of 100.


The stackexchange conversation was new to me.  The solution there won't work with the 
current code, which does not make use of resid().  It has been updated to do the proper 
score test, the older version of cox.zph, which they modified, used an approximation.


Terry T.

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] inverse of the methods function

2021-05-03 Thread Therneau, Terry M., Ph.D. via R-help
Is there a complement to the methods function, that will list all the defined 
methods for 
a class?    One solution is to look directly at the NAMESPACE file, for the 
package that 
defines it, and parse out the entries.   I was looking for something built-in, 
i.e., easier.


-- 
Terry M Therneau, PhD
Department of Health Science Research
Mayo Clinic
thern...@mayo.edu

"TERR-ree THUR-noh"


[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] evil attributes

2021-04-11 Thread Therneau, Terry M., Ph.D. via R-help
I wrote: "I confess to being puzzled WHY the R core has decided on this 
definition..."
After just a little more thought let me answer my own question.

a. The as.vector() function is designed to strip off everything extraneous and 
leave just 
the core.   (I have a mental image of Jack Webb saying "Just the facts ma'am"). 
  I myself 
use it freqently in the test suite for survival, in cases where I'm checking 
the corrent 
numeric result and don't care about any attached names.

  b. is.vector(x) essentially answers the question "does x look like a result 
of as.vector?"

Nevertheless I understand Roger's confusion.

-- 
Terry M Therneau, PhD
Department of Quantitative Health Sciences
Mayo Clinic
thern...@mayo.edu

"TERR-ree THUR-noh"


[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] "exact" p-values

2021-03-20 Thread Therneau, Terry M., Ph.D. via R-help
I am late to this discussion -- I read R-help as a once-a-day summary.  A few 
comments.

1. In the gene-discovery subfield of statistics (SNP studies, etc.)  there is a 
huge 
multiple-testing problem.  In defense, the field thinks in terms of thresholds 
like 1e-5 
or 1e-10 rather than the .05 or .01 most of us are used to.   In that 
literature, they do 
care about  1e-16 vs 1e-20.    We can all argue about whether that is a 
sensible approach 
or not, but it is what it is.  I think that this is the context of the 
journal's request, 
i.e., they want the actual number, however you calculate it.

My own opinion is that these rarified p-values are an arbitrary scale, one that 
no longer 
has a probability interpretation.   For the central limit theorem to be correct 
that far 
from the mean requires a sample size that is beyond imagination  (`number of 
atoms in the 
earth' order of size).   Such a scale may still be useful, but it's not really 
a probability.

2. The label of "Fisher's exact test" has caused decades of confusion.  In this 
context 
the word means "a particular test whose distribution can be completely 
enumerated": it 
does not mean either "correct" or "precise".  The original enumeration methods 
had 
limitations with resspect to the sample size or the presence of complications 
such as tied 
values;  from the discussion so far it would appear that the 'exact' argument 
of 
wilcox.test uses such a method.   Cyrus Mehta did nice work on improved 
algorithms that do 
not have these restrictions, methods that have been refiined and expanded in 
the software 
offerings from Cytel among others. Perhaps someone could update R's code to use 
this, but 
see 3 below.

My own opinion is that permutation tests are an important tool, one "wrench" in 
our 
statistical toolbox.   But they are only one tool out of many.  I am quite put 
off by 
arguments that purposefully conflate "exact" and "correct".

3. The concordance statistic C, the Wilcoxon test, and Somer's d are all the 
same 
statistic, just written a little differently. (Somer's d is essentially 
Kendalls' tau, but 
with a slightly different rule for ties).  A test for C=.5 is the same as a 
Wilcoxon.  For 
a binary response C = the area under the reciever operating curve (AUC).   The 
concordance 
command in the surivival library computes this statistic for continuous, 
binary, or 
censored responses.    The variance is based on a jackknife argument, and is 
computed by 
organizing the data into a binary tree structure, very similar to the methods 
used by 
Mehta, is efficient for large n and is valid for ties.   Perhaps add a link in 
the  
wilcox.test help page?

Footnote: AUC is a special case of C but not vice versa.  People sometimes try 
to extend 
AUC to the other data types, but IMHO with only moderate success.

-- 
Terry M Therneau, PhD
Department of Health Science Research
Mayo Clinic
thern...@mayo.edu

"TERR-ree THUR-noh"


[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] temporary clipping

2020-10-26 Thread Therneau, Terry M., Ph.D. via R-help
In one of my plot functions it is convenient to use clipping to restrict the 
range of some 
output.
But at the end of the function I'd like to turn it off, i.e., so that a 
subsequent use of 
legend(), text() or whatever is not affected.
I don't quite see how to do this -- it seems that the only way to turn off clip 
is with a 
new plot.

-- 
Terry M Therneau, PhD
Department of Health Science Research
Mayo Clinic
thern...@mayo.edu

"TERR-ree THUR-noh"


[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] "chi-square" | "chi-squared" | "chi squared" | "chi, square"

2019-10-19 Thread Therneau, Terry M., Ph.D. via R-help
Martin,
   A fun question.

Looking back at my oldest books, Feller (1950) used chi-square.
Then I walked down the hall to our little statistics library and looked at 
Johnson and 
Kotz, "Continous Univariate Distributions", since each chapter therein has 
comments about 
the history of the distribution.

  a.  They use 'chi-square' throughout their history section, tracing the 
distribution 
back to work in the 1800s.  But, those earliest papers apparently didn't name 
their 
results as chi- whatever, so an "origin" story didn't pan out.

  b. They have 13 pages of references, and for fun I counted the occurence of 
variants.  
The majority of papers don't have the word in the title at all and the next 
most common is 
the Greek symbol. Here are the years of the others:

chi-square:   73 43 65 80 86 73 82 73 69 69 78 64 64 86 65 86 82 82 76 82 88 81 
74 77 87 
86 93 69 60 88 88 80 77 41 59 79 31
chi-squared: 72 76 82 83 89 79 69 67 77 78 69 77 83 88 87 89 78
chi:  92 73 89 87
chi-squares: 77 83
chi-bar-square: 91

There doesn't look to be a trend over time.  The 1922 Fisher reference uses the 
Greek 
symbol, by the way.

Terry T


[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] conflicting results for a time-varying coefficient in a Cox model

2019-08-08 Thread Therneau, Terry M., Ph.D. via R-help
This is an excellent question.
The answer, in this particular case, mostly has to do with the outlier time 
values.  (I've 
never been convinced that the death at time 999 isn't really a misplaced code 
for 
"missing", actually).    If you change the knots used by the spline you can get 
quite 
different values.
For instance, using a smaller data set:

fit1 <- coxph(Surv(tstart, time, status) ~ trt + prior + karno, veteran)
zph1 <- cox.zph(fit1, transform='identity')
plot(zph1[3])

dtime <- unique(veteran$time[veteran$status ==1])    # all of the death times
veteran2 <- survSplit( Surv(time, status) ~ ., data=veteran, cut=dtime)
fit2 <- coxph(Surv(tstart, time, status) ~ trt + prior + karno +
     karno:ns(time, df=4),  data=veteran2)
tx <- 0:100 * 10    # x positions for plot
ncall <- attr(terms(fit2), "predvars")[[6]]
ty <- eval(ncall, data.frame(time = tx)) %*% coef(fit2)[4:7] + coef(fit2)[3]
lines(tx, ty, col=2)

-

Now it looks even worse!  The only difference is that the ns() function has 
chosen a 
different set of knots.

The test used by the cox.zph function is based on a score test and is solid.  
The plot 
that it produces uses a smoothed approximation to the variance matrix and is 
approximate.  
So the diagnostic plot will never exactly match an actual fit.   In this data 
set the 
outliers exacerbate the issue.  To see this try a different time scale.


zph2 <- cox.zph(fit1, transform= sqrt)
plot(zph2[3])
veteran2$stime <- sqrt(veteran2$time)
fit3 <- coxph(Surv(tstart, time, status) ~ trt + prior + karno +
    karno:ns(stime, df=4),  data=veteran2)

ncall3 <-attr(terms(fit3), "predvars")[[6]]
ty3 <- eval(ncall3, data.frame(stime= sqrt(tx))) %*% coef(fit3)[4:7] + 
coef(fit3)[3]
lines(sqrt(tx), ty3, col=2)



The right tail is now better behaved.   Eliminating the points >900 makes 
things even 
better behaved.

Terry T.




On 8/8/19 9:07 AM, Ferenci Tamas wrote:
> I was thinking of two possible ways to
> plot a time-varying coefficient in a Cox model.
>
> One is simply to use survival::plot.cox.zph which directly produces a
> beta(t) vs t diagram.
>
> The other is to transform the dataset to counting process format and
> manually include an interaction with time, expanded with spline (to be
> similar to plot.cox.zph). Plotting the coefficient produces the needed
> beta(t) vs t diagram.
>
> I understand that they're slightly different approaches, so I don't
> expect totally identical results, but nevertheless, they approximate
> the very same thing, so I do expect that the results are more or less
> similar.
>
> However:
>
> library( survival )
> library( splines )
>
> data( veteran )
>
> zp <- cox.zph( coxph(Surv(time, status) ~ trt + prior + karno,
>   data = veteran ), transform = "identity" )[ 3 ]
>
> veteran3 <- survSplit( Surv(time, status) ~ trt + prior + karno,
> data = veteran, cut = 1:max(veteran$time) )
>
> fit <- coxph(Surv(tstart,time, status) ~ trt + prior + karno +
> karno:ns( time, df = 4 ), data = veteran3 )
> cf <- coef( fit )
> nsvet <- ns( veteran3$time, df = 4 )
>
> plot( zp )
> lines( 0:1000, ns( 0:1000, df = 4, knots = attr( nsvet, "knots" ),
> Boundary.knots = attr( nsvet, "Boundary.knots" ) )%*%cf[
>   grep( "karno:ns", names( cf ) ) ] + cf["karno"],
> type = "l", col = "red" )
>
> Where is the mistake? Something must be going on here, because the
> plots are vastly different...
>
> Thank you very much in advance,
> Tamas


[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Trying to understand the magic of lm (Still trying)

2019-05-13 Thread Therneau, Terry M., Ph.D. via R-help
John,

  The text below is cut out of a "how to write a package" course I gave at the 
R 
conference in Vanderbilt.   I need to find a home for the course notes, because 
it had a 
lot of tidbits that are not well explained in the R documentation.
Terry T.



Model frames:
One of the first tasks of any modeling routine is to construct a special data 
frame 
containing the covariates that will be used, via a call to the model.frame 
function. The 
code to do this is found in many routines, and can be a little opaque on first 
view. The 
obvious code would be
\begin{verbatim}
coxph <- function(formula, data, weights, subset, na.action,
     init, control, ties= c("efron", "breslow", "exact"),
     singular.ok =TRUE, robust=FALSE,
     model=FALSE, x=FALSE, y=TRUE,  tt, method=ties, ...) {

  mf <- model.frame(formula, data, subset, weights, na.action)
\end{verbatim}
since those are the coxph arguments that are passed forward to the model.frame 
routine.  
However, this simple approach will fail with a ``not found'' error message if 
any of the 
data, subset, weights, etc. arguments are missing. Programs have to take the 
slightly more 
complicated approach of constructing a call.
\begin{verbatim}
Call <- match.call()
indx <- match(c("formula", "data", "weights", "subset", "na.action"),
   names(Call), nomatch=0)
if (indx[1] ==0) stop("A formula argument is required")
temp <- Call[c(1,indx)]  # only keep the arguments we wanted
temp[[1]] <- as.name('model.frame')  # change the function called
mf <- eval(temp, parent.frame())

Y <- model.response(mf)
etc.
\end{verbatim}

We start with a copy of the call to the program, which we want to save anyway 
as 
documentation in the output object. Then subscripting is used to extract only 
the portions 
of the call that we want, saving the result in a temporary. This is based on 
the fact that 
a call object can be viewed as a list whose first element is the name of the 
function to 
call, followed by the arguments to the call. Note the use of \code{nomatch=0}; 
if any 
arguments on the list are missing they will then be missing in \code{temp}, 
without 
generating an error message. The \mycode{temp} variable will contain a object 
of type 
``call'', which is an unevaluated call to a routine.  Finally, the name of the 
function to 
be called is changed from ``coxph'' to ``model.frame'' and the call is 
evaluated.  In many 
of the core routines the result is stored in a variable ``m''.  This is a 
horribly short 
and non-descriptive name. (The above used mf which isn't a much better.)  Many 
routines 
also use ``m'' for the temporary variable leading to \code{m <- eval(m, 
parent.frame())}, 
but I think that is unnecessarily confusing.

The list of names in the match call will include all arguments that should be 
evaluated 
within context of the named dataframe. This can include more than the list 
above, the 
survfit routine for instance has an optional argument ``id'' that names an 
identifying 
variable (several rows of the data may represent a single subject), and this is 
included 
along with ``formula'' etc in the list of choices in the match function.  The 
order of 
names in the list makes no difference.  The id is later retrieved with 
\code{model.extract(m, 'id')}, which will be NULL if the argument was not 
supplied. At the 
time that coxph was written I had not caught on to this fact and thought that 
all 
variables that came from a data frame had to be represented in the formula 
somehow, thus 
the use of \code{cluster(id)} as part of the formula, in order to denote a 
grouping variable.

On 5/11/19 5:00 AM, r-help-requ...@r-project.org wrote:
> A number of people have helped me in my mission to understand how lm (and 
> other fucntions) are able to pass a dataframe and then refer to a specific 
> column in the dataframe. I thank everyone who has responded. I now know a bit 
> about deparse(substitute(xx)), but I still don't fully understand how it 
> works. The program below attempts to print a column of a dataframe from a 
> function whose parameters include the dataframe (df) and the column requested 
> (col). The program works fine until the last print statement were I receive 
> an error,  Error in `[.data.frame`(df, , col) : object 'y' not found . I hope 
> someone can explain to me (1) why my code does not work, and (2) what I can 
> do to fix it.


[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] adding a hex sticker to a package

2019-01-21 Thread Therneau, Terry M., Ph.D. via R-help
I've created a hex sticker for survival.  How should that be added to the 
package 
directory?   It's temporarily in man/figures on the github page.

Terry T.

(Actually, the idea was from Ryan Lennon. I liked it, and we found someone with 
actual 
graphical skills to execute it. )

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] weighed Fleming-Harrington log rank test

2018-02-16 Thread Therneau, Terry M., Ph.D.
You are correct that the survreg routine only supports 'rho' of the 
Fleming-Harrington G-rho tests.  This is a function of age -- I wrote the 
original code back when I was working with Tom (Fleming), and he was only using 
1 parameter.  Later he and David expanded the test to two parameters.  This 
might be only the second request for the feature in the 30+ years since that 
date.

Terry Therneau

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Time-dependent coefficients in a Cox model with categorical variants

2018-01-18 Thread Therneau, Terry M., Ph.D.

First, as others have said please obey the mailing list rules and turn of
First, as others have said please obey the mailing list rules and turn off 
html, not everyone uses an html email client.

Here is your code, formatted and with line numbers added.  I also fixed one 
error: "y" should be "status".

1. fit0 <- coxph(Surv(futime, status) ~ x1 + x2 + x3, data = data0)
2. p <- log(predict(fit0, newdata = data1, type = "expected"))
3. lp <- predict(fit0, newdata = data1, type = "lp")
4. logbase <- p - lp
5. fit1 <- glm(status ~ offset(p), family = poisson, data = data1)
6. fit2 <- glm(status~ lp + offset(logbase), family = poisson, data = data1)
7. group <- cut(lp, c(-Inf, quantile(lp, (1:9) / 10), Inf))
8. fit3 <- glm(status ~ -1 + group + offset(p), family = poisson, data = data1)

The key idea of the paper you referenced is that the counterpart to the 
Hosmer-Lemishow test (wrong if used directly in a Cox model) is to look at the 
predicted values from a Cox model as input to a Poisson regression.  That means 
adding the expected from the Cox model as a fixed term in the Poisson.  And 
like any other poisson that means offset(log(expected)) as a term.

The presence of time dependent covariates does nothing to change this, per se, 
since expected for time fixed is the same as for time varying.  In practice it 
does matter, at least philosophically.  Lines 1, 2, 5 do this just fine.

If data1 is not the same as data0, a new study say, then the test for 
intercept=0 from fit1 is a test of overall calibration.  Models like line 8 try 
to partition out where any differences actually lie.

The time-dependent covariates part lies in the fact that a single subject may 
be represented by multiple lines in data0 and/or data1.  Do you want to 
collapse that person into a single row before the glm fits?  If subject "Jones" 
is represented by 15 lines in the data and "Smith" by 2, it does seem a bit 
unfair to give Jones 15 observations in the glm fit.  But full discussion of 
this is as much philosophy as statistics, and is perhaps best done over a beer.

Terry T.


From: Max Shell [archerr...@gmail.com]
Sent: Wednesday, January 17, 2018 10:25 AM
To: Therneau, Terry M., Ph.D.
Subject: Re: Time-dependent coefficients in a Cox model with categorical 
variants

Assessing calibration of Cox model with time-dependent 
coefficients<https://stats.stackexchange.com/questions/323569/assessing-calibration-of-cox-model-with-time-dependent-coefficients>

I am trying to find methods for testing and visualizing calibration to Cox 
models with time-depended coefficients. I have read your nice 
article<http://journals.sagepub.com/doi/10.1177/0962280213497434>. In this 
paper, we can fit three models:

fit0 <- coxph(Surv(futime, status) ~ x1 + x2 + x3, data = data0) p <- 
log(predict(fit0, newdata = data1, type = "expected")) lp <- predict(fit0, 
newdata = data1, type = "lp") logbase <- p - lp fit1 <- glm(y ~ offset(p), 
family = poisson, data = data1) fit2 <- glm(y ~ lp + offset(logbase), family = 
poisson, data = data1) group <- cut(lp, c(-Inf, quantile(lp, (1:9) / 10), Inf)) 
fit3 <- glm(y ~ -1 + group + offset(p), family = poisson, data = data1)

Here$B!$(BI simplely use data1$B!!(B<- data0[1:500,]

First, I get following error when running line 5.

Error in eval(predvars, data, env) : object 'y' not found

So I modifited the code by replacing the y as status looks like this:

fit1 <- glm(status ~ offset(p), family = poisson, data = data1) fit2 <- 
glm(status ~ lp + offset(logbase), family = poisson, data = data1) group <- 
cut(lp, c(-Inf, quantile(lp, (1:9) / 10), Inf)) fit3 <- glm(status ~ -1 + group 
+ offset(p), family = poisson, data = data1)

Is this replacing correct?

Second, I try to introduce the time-transform use coxph with ttparament.

My code is:  fit0 <- coxph(Surv(time, status) ~ x1 + x2 + x3 + tt(x3), data = 
data0, function(x, t, ...) x * t) p <- log(predict(fit0, newdata = data1, type 
= "expected")) lp <- predict(fit0, newdata = data1, type = "lp") logbase <- p - 
lp fit1 <- glm(status ~ offset(p), family = poisson, data = data1) fit2 <- 
glm(status ~ lp + offset(logbase), family = poisson, data = data1) group <- 
cut(lp, c(-Inf, quantile(lp, (1:9) / 10), Inf)) fit3 <- glm(status ~ -1 + group 
+ offset(p), family = poisson, data = data1)

My questions is:

  *   Is the code above correct?
  *   How to interpret the fit1, fit2, fit3? What's the connection between the 
three models and the calibration of the Cox model?
  *   How to generate the calibration plot using fit3? The article dose have a 
section discuss this, but no code is provided.

Thank you!

On Mon, Jan 15, 2018 at 9:23 PM, Therneau, Terry M., Ph.D. 
<thern...@mayo.edu<mailto:thern...@mayo.edu>&g

[R] help matching rows of a data frame

2017-09-18 Thread Therneau, Terry M., Ph.D.
This question likely has a 1 line answer, I'm just not seeing it.  (2, 3, or 10 lines is 
fine too.)


For a vector I can do group  <- match(x, unqiue(x)) to get a vector that labels each 
element of x.

What is an equivalent if x is a data frame?

The result does not have to be fast: the data set will have < 100 elements.  Since this is 
inside the survival package, and that package is on  the 'recommended' list, I can't 
depend on any package outside the recommended list.


Terry T.

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Odd results from rpart classification tree

2017-05-15 Thread Therneau, Terry M., Ph.D.
You are mixing up two of the steps in rpart.  1: how to find the best candidate split and 
2: evaluation of that split.


With the "class" method we use the information or Gini criteria for step 1.  The code 
finds a worthwhile candidate split at 0.5 using exactly the calculations you outline.  For 
step 2 the criteria is the "decision theory" loss.  In your data the estimated rate is 0 
for the left node and 15/45 = .333 for the right node.  As a decision rule both predict 
y=0 (since both are < 1/2).  The split predicts 0 on the left and 0 on the right, so does 
nothing.


The CART book (Brieman, Freidman, Olshen and Stone) on which rpart is based highlights the 
difference between odds-regression (for which the final prediction is a percent, and error 
is Gini) and classification.  For the former treat y as continuous.


Terry T.


On 05/15/2017 05:00 AM, r-help-requ...@r-project.org wrote:

The following code produces a tree with only a root. However, clearly the tree 
with a split at x=0.5 is better. rpart doesn't seem to want to produce it.

Running the following produces a tree with only root.

y <- c(rep(0,65),rep(1,15),rep(0,20))
x <- c(rep(0,70),rep(1,30))
f <- rpart(y ~ x, method='class', minsplit=1, cp=0.0001, 
parms=list(split='gini'))

Computing the improvement for a split at x=0.5 manually:

obs_L <- y[x<.5]
obs_R <- y[x>.5]
n_L <- sum(x<.5)
n_R <- sum(x>.5)
gini <- function(p) {sum(p*(1-p))}
impurity_root <- gini(prop.table(table(y)))
impurity_L <- gini(prop.table(table(obs_L)))
impurity_R <- gini(prop.table(table(obs_R)))
impurity <- impurity_root * n - (n_L*impurity_L + n_R*impurity_R) # 2.880952

Thus, an improvement of 2.88 should result in a split. It does not.

Why?

Jonathan




__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] finegray function in the survival package

2017-05-03 Thread Therneau, Terry M., Ph.D.
Your question is much like "what is in my pocket?" since the only possible answer is "how 
could I know?".  You have some block of R code that produce the stated error, but we don't 
know what it is.  One wild guess is that a death time which is <=0 will produce this error 
since that would lead to an invalid time interval.


Please look at the "Asking for help" section of https://www.r-project.org/help.html, and 
in particular the link on reproducable examples.  You need to give an example, with data, 
that reproduces the problem.


Terry T.

(PS, I agree that the error message is too terse.  I will try to make it 
better.)


On 05/03/2017 05:00 AM, r-help-requ...@r-project.org wrote:

Hello R-team,

I am trying to use the tmerge function from survvial library. My data is
similar to mgus2 dataset from R.

But, I get the following message.

Error in tmerge(dat, dat, id = Number, death = event(dYears, death), BMF =
event(ptemp,  :
  tstart must be > tstop
Could you help me?

Thanks,
Ahalya.



On Wed, Sep 7, 2016 at 8:29 AM, Therneau, Terry M., Ph.D. <thern...@mayo.edu

wrote:

On 09/07/2016 05:00 AM, r-help-requ...@r-project.org wrote:


Dear R-Team,

I have been trying to use the finegray routine that creates a special data
so that Fine and Gray model can be fit. However, it does not seem to work.
Could you please help me with this issue?


Thanks,
Ahalya.



You have given us no details of your example code that "doesn't work", and
I can't read your mind.  So no, we can't help.  Give us a hint.

Terry T



__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] survival package can't find Ccoxfit6

2017-04-27 Thread Therneau, Terry M., Ph.D.



On 04/27/2017 09:53 AM, sesh...@mskcc.org wrote:

Thank you Drs. Therneau and Murdoch.

"Why not use coxph.fit?" -- My use case scenario is that I needed the Cox model 
coefficients for resampled data. I was trying to reduce the computational overhead of 
coxph.fit (since it will repeated a large number of times) by stripping all the parts 
that I don't need such as sorting of the data prior to coxfit6.c call and Martingale 
residual and concordance computations after the parameters are estimated.


That is an interesting use case which I had not thought about.  The first question is just 
how much slower coxph.fit is than the stripped down version (I'd guess about 1/2 but that 
is just a guess), and whether that makes a real difference to your code.  If it is 
spending 10% of its time in the coxph calculation a change to 5% isn't that much, but 90% 
is something else.  The next is what is the main impediment (I'd guess concordance, but 
again just a guess.)   Perhaps I could add concordance= and/or resid= flags to the fitting 
routine.




Under the R v3.4.0 model one cannot create any modified form of coxph.fit and expect it to work. 
Worse yet is the following where I copy "coxph.fit" to my workspace as 
"mycoxph.fit" (works initially because the environment is namespace:survival and fails 
when environment changed to R_GlobalEnv)



If you were under linux another solution would be to grab the source from github, add your 
routine to the R/ directory, then R CMD build followed by R CMD INSTALL.  Macintosh is 
essentially as easy, though you need to install Xcode for the compilers.  The compile 
toolchain for windows is outside my ken.


Let's keep talking.

Terry T.

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] survival package can't find Ccoxfit6

2017-04-27 Thread Therneau, Terry M., Ph.D.

 Let me summarize rather than repeat the entire thread:

An error report from a user (seshan) stumped me, and I asked for help here.

Duncan Murdoch picked up on fine details of the error message, i.e., that the error did 
NOT come from within the survival package.  That changes the whole tenor of the discussion.


Indeed, the user has their own function "phcoefs" that directly calls one of my internal C 
routines.  As of R 3.4, this can only be done for routines that I explicitly export.   I 
don't export coxfit6.c.


Where to go from here?

1. I'm not against exporting a routine, but I'm not going to do it without a discussion.  
Doing so is more work for me: I'd need to write a test routine in order to ensure 
long-term reliability of the export, and it ties my hands wrt future changes.  In this 
partiuclar case, why not use coxph.fit?


2. One of the design goals for the survival package is to make it usable as a component 
for other's work.  For instance all of the return structures are easily inspected (no S4 
classes) and most are carefully documented e.g. help(coxph.object).  The core computations 
of coxph are split out into separate functions coxph.fit and agreg.fit, so that they can 
be called directly without the formula and argument checking overhead.  Ditto for survreg, 
survifit, survdiff and concordance.  Given the number of other packages that depend on 
survival I have been at least moderately successful at this aim.  (To be honest this is 
not entirely alturism on my part as it stops the near infinite requests to add one 'just 
one more thing' to the package.)  This also means I am open to modifying a routine or 
exporting a call -- if you can make a good argument.


3. I need a good way to document this for the survival package. Yet one more chapter in my 
1/2 written book. Someday...


4. Calling another package's C routines is dangerous, and one of the goals of the 3.4 
namespace changes was to stop this from happening willy-nilly.  The new error messages 
look like success.   Though it means that I'm getting multiple "not found" emails.


Terry T.

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] survival package can't find Ccoxfit6

2017-04-26 Thread Therneau, Terry M., Ph.D.
A user contacted me directly about this, I answered with my best understanding of the 
recent R-help discussion of the issue, and their response to my response shows that I'm 
not quite right.


I am emphatically not an MS Windows user so am asking for help -- which I will cut/paste 
to this user and to the next dozen who will invariably contact me directly.


Thanks,
  Terry Therneau



 Forwarded Message 
Subject: RE: survival package
Date: Wed, 26 Apr 2017 18:05:30 +
From: sesh...@mskcc.org
To: Therneau, Terry M., Ph.D. <thern...@mayo.edu>

Thank you for the quick response. The session info command for v3.4.0 does in fact report 
survival_2.41-3. Furthermore, while both v3.3.1 and v3.40 are on the same computer the 
library paths do not have any directory in common:



.libPaths()

[1] "C:/Program Files/R/R-3.4.0/library"




and

.libPaths()

[1] "C:/Program Files/R/R-3.3.1/library"





Thanks,
Venkat


-Original Message-----
From: Therneau, Terry M., Ph.D. [mailto:thern...@mayo.edu] Sent: Wednesday, April 26, 2017 
1:42 PM

To: Seshan, Venkatraman E./Epidemiology-Biostatistics
Subject: Re: survival package

This has been discussed in R-help by multiple people.  You have a pre-3.4 version of the 
survival package somewhere on your search path, and the method for resolving .C calls has 
changed.   The sessionInfo command should report survival version 2.41-3.


Terry T.


On 04/26/2017 12:17 PM, sesh...@mskcc.org wrote:

Dear Prof. Therneau,

I am encountering an error message when I try to use the coxfit6 routine from 
the survival package under the 3.4.0 version of R. The minimal function and the 
script are in the attached file. This function worked under earlier versions of 
R.

--
-

***
**  Works under R-3.3.1  **
***


source("coxfit6-issue.R")

[1] -0.4838181


sessionInfo()

R version 3.3.1 (2016-06-21)
Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows 7 x64
(build 7601) Service Pack 1

locale:
[1] LC_COLLATE=English_United States.1252 [2] LC_CTYPE=English_United
States.1252 [3] LC_MONETARY=English_United States.1252 [4]
LC_NUMERIC=C [5] LC_TIME=English_United States.1252

attached base packages:
[1] stats graphics  grDevices utils datasets  methods   base

other attached packages:
[1] survival_2.39-4

loaded via a namespace (and not attached):
[1] Matrix_1.2-6splines_3.3.1   grid_3.3.1  lattice_0.20-33

--
-

***
**  Does not work under R-3.4.0  **
***


library(survival)
source("coxfit6-issue.R")

Error in .Call("Ccoxfit6", as.integer(control$iter.max), stime, 
as.integer(sstat),  :
   "Ccoxfit6" not available for .Call() for package "survival"

sessionInfo()

R version 3.4.0 (2017-04-21)
Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows 7 x64
(build 7601) Service Pack 1

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252 [2] LC_CTYPE=English_United
States.1252 [3] LC_MONETARY=English_United States.1252 [4]
LC_NUMERIC=C [5] LC_TIME=English_United States.1252

attached base packages:
[1] stats graphics  grDevices utils datasets  methods   base

other attached packages:
[1] survival_2.41-3

loaded via a namespace (and not attached):
[1] compiler_3.4.0  Matrix_1.2-9splines_3.4.0   grid_3.4.0
[5] lattice_0.20-35

--
-

When I remove the quotes surrounding Ccoxfit6 in the function both versions 
give the error:

Error in phcoefs(stim[ii], sts[ii], as.matrix(as.double(cvt[ii])), 
oo$coefficients,  :
   object 'Ccoxfit6' not found


I would greatly appreciate your help in resolving this.

Thanks,
Venkat Seshan




=

 Please note that this e-mail and any files transmitted from
 Memorial Sloan Kettering Cancer Center may be privileged, confidential,
 and protected from disclosure under applicable law. If the reader of
 this message is not the intended recipient, or an employee or agent
 responsible for delivering this message to the intended recipient,
 you are hereby notified that any reading, dissemination, distribution,
 copying, or other use of this communication or any of its attachments
 is strictly prohibited.  If you have received this communication in
 error, please notify the sender immediately by replying to this message
 and deleting this message, any attachments, and all copies and backups
 from your computer.

__
R-help@r-project.org mailing list -- To UNSUBSCRI

[R] finding out if a method exists

2017-03-28 Thread Therneau, Terry M., Ph.D.
I'm thinking of adding a new "cmatrix" function/method to the survival package but before 
I do I'd like to find out if any other packages already use this function name.   The 
obvious method is to look at the NAMESPACE file for each package in CRAN and read the 
export list.


This is the kind of task for which someone, somewhere will have written routines.  I just 
don't know who or where.


Any hints?

Terry T.

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] R: How to prune using holdout data

2017-02-28 Thread Therneau, Terry M., Ph.D.

Let me give an outline of how to answer Alfredo's question via an example.
I will split the data set "lung" into two peices.  For these subjects with
advanced lung cancer the physician's assessment of ECOG performance status
(ph.ecog) is one of the most powerful indicators of outcome.  Try to
predict it from other variables.

library(survival)   # for the test data set
library(rpart)

data1 <- lung[1:125,]
data2 <- lung[126:228,]

rfit1 <- rpart(ph.ecog ~ ., data=data1)
printcp(rfit1)

CP nsplit rel error  xerror xstd
1 0.565788  0   1.0 1.04037 0.100516
2 0.098516  1   0.43421 0.44906 0.045001
3 0.042708  2   0.33570 0.35134 0.041692
4 0.031032  3   0.29299 0.37610 0.042971
5 0.019949  4   0.26196 0.37753 0.044692
6 0.01  5   0.24201 0.39166 0.050332

# Validate using data2.  First get predictions for each of the pruned trees
cpvalues <- rfit1$cptable[,1]
pmat <- matrix(0, nrow(data2), length(cpvalues))
for (i in 1:length(cpvalues))
pmat[,i] <- predict(prune(rfit1, cpvalues[i]), newdata=data2)

Now, we need to decide what on a measure of error.  Try simple squared error.

error <- colMeans((data2$ph.ecog - pmat)^2)
round(error, 3)
[1] 0.493 0.280 0.210 0.225 0.186 0.198

This is simple, but other cases are more complex.  The performace score is
actually an integer from 0-4 (5= dead), see
  http://ecog-acrin.org/resources/ecog-performance-status

table(lung$ph.ecog)
  0   1   2   3
 63 113  50   1

Suppose instead we fit a model and treat the response as categorical?
The total number of nested models is a bit smaller.

rfit2 <- rpart(ph.ecog ~ ., data=data1, method="class")

printcp(rfit2)
   CP nsplit rel error  xerror xstd
1 0.35938  0   1.0 1.0 0.086951
2 0.12500  1   0.64062 0.64062 0.081854
3 0.06250  2   0.51562 0.70312 0.083662
4 0.03125  4   0.39062 0.57812 0.079610
5 0.01000  5   0.35938 0.56250 0.078977

 predict(rfit2, newdata=data2)[1:5,]
  0  1   2 3
126 0.03125 0.9375 0.03125 0
127 0.03125 0.9375 0.03125 0
128 0.03125 0.9375 0.03125 0
129 0.03125 0.9375 0.03125 0
130 0.37500 0.6250 0.0 0

Now, we can ask for predicted probabilities for each class (default), which is 
a vector
of length 4 for each subject, or for the predicted class, which is a single 
value.  Which
do we want, and then what is the best measure of prediction error?
If three subjects with value 0 had prediction class vectors of (.8, .2, 0, 0),
(.8, .1, .1, 0) and (.45, .25, .2, .1), one outlook would say they all are the
same (all pick 0 as the best), others would give them different errors.  Is
the second prediction worse than the first?

What if the single subject with ph.ecog=3 had ended up in the validation data
set; how should we judge their prediction?

This complexity is one reason that there is not a simple function for
"validation" with a new data set.



On 02/27/2017 09:48 AM, Alfredo wrote:

Thank you, Terry, for your answer.

I’ll try to explain better my question. When you create a classification or 
regression
tree you first grow a tree based on a splitting criteria: this usually results 
in a large
tree that provides a good fit to the training data. The problem with this tree 
is its
potential for overfitting the data: the tree can be tailored too specifically 
to the
training data and not generalize well to new data. The solution (apart 
cross-validation)
is to find a smaller subtree that results in a low error rate on *holdout or 
validation data.*

Hope it helps to clarity my question.

Best,

Alfredo

-Messaggio originale-
Da: Therneau, Terry M., Ph.D. [mailto:thern...@mayo.edu]

You will need to give more detail of exactly what you mean by "prune using a 
validation
set". THe prune.rpart function will prune at any value you want, what I suspect 
you are
looking for is to compute the error of each possible tree, using a validation 
data set,
then find the best one, and then prune there.

How do you define "best"?



__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] How to prune using a validation set

2017-02-27 Thread Therneau, Terry M., Ph.D.
You will need to give more detail of exactly what you mean by "prune using a validation 
set".  THe prune.rpart function will prune at any value you want, what I suspect you are 
looking for is to compute the error of each possible tree, using a validation data set, 
then find the best one, and then prune there.


How do you define "best"?

Terry Therneau


On 02/26/2017 05:00 AM, r-help-requ...@r-project.org wrote:

I'd like to use a different data ( validation) set for pruning my
classification tree. Unfortunately there aren't arguments to get this in
prune.rpart().

Any suggestions?

Thanks!

Alfredo



__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] Can mice() handle crr()? Fine-Gray model

2017-01-23 Thread Therneau, Terry M., Ph.D.
Look at the finegray command within the survival package; the competing risks vignette has 
coverage of it.  The command creates an expanded data set with case weights, such that 
coxph() on the new data set = the Fine Gray model for the original data.  Anything that 
works with coxph is valid on the new data.  Caveat -- I don't know mice() well at all: the 
dat1 data set below has multiple observations per subject, should the mice() command be 
cognizant of this?


Terry Therneau


library(survival)
library(mice)

test1 <- data.frame (time=c(4,3,1,1,2,2,3,5,2,4,5,1,
4,3,1,1,2,2,3,5,2,4,5,1),
 status=c(1,1,1,0,2,2,0,0,1,1,2,0,
  1,1,1,0,2,2,0,0,1,1,2,0),
 x=c(0,2,1,1,NA,NA,0,1,1,2,0,1,
 0,2,1,1,NA,NA,0,1,1,2,0,1),
 sex=c(0,0,0,NA,1,1,1,1,NA,1,0,0,
   0,0,0,NA,1,1,1,1,NA,1,0,0)))

# Endpoint 1, simple, then with mice
fit0 <- copxh(Surv(time, status==1) ~ sex + x, test1)
dat0 <- mice(test1, m=10)
mfit0 <- with(dat0, coxph(Surv(time, status==1) ~ sex + x))
summary(pool(mfit0))

# Endpoint 1, Fine-Gray model
fg1 <- finegray(Surv(time, factor(status)) ~ ., data=test1, etype=1)
fit1 <- coxph(Surv(fgstart, fgstop, fgstatus) ~ sex + x, data=fg1,
  weight= fgwt)

dat1 <- mice(fg1, m=10)
mfit1 <- with(dat1, coxph(Surv(fgstart, fgstop, fgstatus) ~ sex + x,
 weight=fgwt))


On 01/23/2017 05:00 AM, r-help-requ...@r-project.org wrote:

Here is an example:


# example

library(survival)
library(mice)
library(cmprsk)

test1 <- as.data.frame(list(time=c(4,3,1,1,2,2,3,5,2,4,5,1,
4,3,1,1,2,2,3,5,2,4,5,1),
 status=c(1,1,1,0,2,2,0,0,1,1,2,0,
1,1,1,0,2,2,0,0,1,1,2,0),
 x=c(0,2,1,1,NA,NA,0,1,1,2,0,1,
0,2,1,1,NA,NA,0,1,1,2,0,1),
 sex=c(0,0,0,NA,1,1,1,1,NA,1,0,0,
0,0,0,NA,1,1,1,1,NA,1,0,0)))

dat <- mice(test1,m=10)

#Cox regression: cause 1

models.cox1 <- with(dat,coxph(Surv(time, status==1) ~ x +sex
))

summary(pool(models.cox1))

#Cox regression: cause 1 or 2

models.cox <- with(dat,coxph(Surv(time, status==1 | status==2) ~ x +sex
))
models.cox
summary(pool(models.cox))


# crr()

#Fine-Gray model

models.FG<- with(dat,crr(ftime=time, fstatus=status,  cov1=test1[,c(
"x","sex")], failcode=1, cencode=0, variance=TRUE))

summary(pool(models.FG))

#Error in pool(models.FG) : Object has no vcov() method.

models.FG




-- Andreu Ferrero Gregori


__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] Hyperbola formula

2017-01-19 Thread Therneau, Terry M., Ph.D.
This simple form of a hyperbola is not well known.  I find it useful for change point 
models: since the derivative is continuous it often behaves better in a maximizer.


h1 <- function(x, b, k=3)  .5 * b * (x + sqrt(x^2 + k^2))

Function h1() has asymptotes of y=0 to the left of 0 and y=x to the right of 0.  The 
parameter k controls how tightly the curve bends at zero.


A generalization is
h2 <- function (x, b, k=3) {
z <- x - b[4]
b[1] + b[2]*z + .5*b[3]* (z + sqrt(z^2 + k^2))
}
b[1] and b[2] are the intercept and slope of the left portion of the curve, b[4] is the 
inflection point, and b[3] is the change in slope at the inflection point.


The main downside is that k is arbitrary. I simply choose it to "look right", though it 
too could be part of an optimization.


Terry Therneau

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] Parallel package guidance needed

2017-01-16 Thread Therneau, Terry M., Ph.D.
I have a process that I need to parallelize, and have a question about two
different ways to proceed.  It is essentially an MCMC exploration where
the likelihood is a sum over subjects (6000 of them), and the per-subject
computation is the slow part.

Here is a rough schematic of the code using one approach:

mymc <- function(formula, data, subset, na.action,  id, etc) {
 # lots of setup, long but computationally quick

 hlog <- function(thisid, param) {
# compute the loglik for this subject
   ...
   }

uid <- unique(id)  # multiple data rows for each subject

for (i in 1:burnin) {
   param <- get_next_proposal()
   loglist <- mclapply(uid, hlog, param=param)
   loglik <- sum(unlist(loglist))
   # process result
   }

   # Now the non-burnin MCMC iterations
  �
}

The second approach is to put cluster formation outside the loop, e.g.,

 ...
 clust <- makeForkCluster()
 for (i in 1:burnin) {
 param <- get_next_proposal()
 loglist <- parLapply(clust, uid, hlog, param=param)
 loglik <- sum(unlist(loglist))
 # process result
 }

   # rest of the code

   stopCluster(clust)

--

On the face of it, the second looks like it "could" be more efficient since it
only starts and stops the subprocesses once.  A short trial on one of our
cluster servers seems to say the opposite.  The load average on a quiet machine
never gets much over 5-6  using method 2, and in the 20s for method 1
(detectCores() =80 on the box, we used mc.cores=50).  Wall time for method 2
is looking to be several hours.

Any pointers to documentation/discussion at this level would be much 
appreciated.  I'm going to be fitting a lot of models.

Terry T.


[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

[R] which parallel routine

2017-01-07 Thread Therneau, Terry M., Ph.D.
I'm looking for advice on which of the parallel systems to use.

Context: maximize a likelihood, each evaluation is a sum over a large number of 
subjects (>5000) and each of those per subject terms is slow and complex.

If I were using optim the context would be
  fit <- optim(initial.values, myfun, �.)
  myfun <- function(params) {
   � Do some initial setup�
   temp <- apply-in-parallel(id,  per-subject-eval-fun, p=params)
   unlist(temp)
}

  The use of mcapply seems like there would be a lot of overhead starting and 
stopping threads.   But none of the tutorials I've found addresses this 
particular question.  Both direct answers and pointers to other docs would be 
welcome.

Terry T.



[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] independent censoring

2016-11-29 Thread Therneau, Terry M., Ph.D.



On 11/29/2016 05:00 AM, r-help-requ...@r-project.org wrote:

Independent censoring is one of the fundamental assumptions in the survival 
analysis. However, I cannot find any test for it or any paper which discusses 
how real that assumption is.


I would be grateful if anybody could point me to some useful references. I have 
found the following paper as an interesting reference but it is not freely 
available.


Leung, Kwan-Moon, Robert M. Elashoff, and Abdelmonem A. Afifi. "Censoring issues in 
survival analysis." Annual review of public health 18.1 (1997): 83-104.




This is because there is no test for independent censoring.  Say I am following a cohort 
of older gentlemen (65 years old) who were diagnoses with condition "x", and after 8 years 
there a a dozen who no longer answer my letters.  Why not?

 a. Perhaps because they are in a nursing home, with dementia.
 b. Perhaps because they have moved to another city to be interact and be near to 
grandchildren.


In case a, those lost to follow-up are much sicker than the average subject, and in case b 
they are most likely the most healthy and active of the group.   In a) the KM will 
over-estimate survival and in b it will underestimate.


The main point is that there is absolutely no way to know, other than actually tracking 
the subjects down.  Any study which has a substantial fraction with incomplete follow-up 
is making a guess.  The more accurate phrase would be "a blind hope for independent 
censoring" than "assume".   There are cases where simple reasoning or experience tells me 
that this hope is futile, but mostly we just hope.  The alternative is proactive 
follow-up, i.e., devote enough staff and resources to actively contact all of the study 
subjects on a regular schedule.   Even then you will lose a few.  (In one study several 
years ago, long term follow-up of cancer, there was a new Mrs Smith who refused to 
acknowledge the existence of the prior wife, even to forward letters.)


Terry Therneau

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] Updated package: survival_2.40-1

2016-11-01 Thread Therneau, Terry M., Ph.D.
Survival version 2.40 has been relased to CRAN.  This is a warning that some users may see 
changes in results, however.


The heart of the issue can be shown with a simple example. Calculate the following simple 
set of intervals:

<>=
birth <- as.Date("1973/03/10")
start <- as.Date("1998/09/13") + 1:40
end   <- as.Date("1998/12/03") + rep(1:10, 4)
interval <- (end-start)
table(interval)
51 61 71 81
10 10 10 10
@

Each interval has a different start and end date, but there are only 4 unique intervals, 
each of which appears 10 times.

Now convert this to an age scale.

<>=
start.age <- as.numeric(start-birth)/365.25
end.age   <- as.numeric(end  -birth)/365.25
age.interval <- end.age - start.age
table(match(age.interval, unique(age.interval)))
1 2 3 4 5 6 7 8
9 1 5 5 1 9 7 3
@
There are now eight different age intervals instead of 4, and the 8 unique values appear 
between 1 and 9 times each.  Exact results likely will depend on your computer system. We 
have become a victim of round off error.


Some users prefer to use time in days and some prefer time in years, and those latter 
users expect, I am sure, survival analysis results to be identical on the two scales.  
Both the coxph and survfit routines treat tied event times in a special way, and this 
roundoff can make actual ties appear as non-tied values, however. Parametric survival such 
as \code{survreg} is not affected by this issue.


In survival version 2.40 this issue has been addressed for the coxph and survfit routines; 
input times are subjected to the same logic found in the all.equal routine in order to 
determine actual ties. The upshot is that some users may experience a changed results.


For the following test case cox1 and cox2 are identical coefficients in version 2.40, but 
different in prior versions.

<<>>=
ndata <- data.frame(id=1:30,
  birth.dt = rep(as.Date("1953/03/10"), 30),
  enroll.dt= as.Date("1993/03/10") + 1:30,
  end.dt   = as.Date("1996/10/21") + 1:30 +
  rep(1:10, 3),
  status= rep(0:1, length=30),
  x = 1:30)
ndata$enroll.age <- with(ndata, as.numeric(enroll.dt - birth.dt))/365.25
ndata$end.age<- with(ndata, as.numeric(end.dt - birth.dt))/365.25

fudays <- with(ndata, as.numeric(end.dt - enroll.dt))
fuyrs  <- with(ndata, as.numeric(end.age- enroll.age))
cox1 <- coxph(Surv(fudays, status) ~ x, data=ndata)
cox2 <- coxph(Surv(fuyrs,  status) ~ x, data=ndata)
@

This general issue of floating point precision arises often enough in R that is part of 
the frequently asked questions, see FAQ 7.31 on CRAN. The author of the survival routines 
(me) has always used days as the scale
for analysis -- just by habit, not for any particluarly good reason -- so the issue had 
never appeared in my work nor in the survival package's test suite. Due to user input, 
this issue had been addressed earlier in the survfit routine, but only when the status 
variable was 0/1, not when it is a factor.


As a final footnote, the simple data set above also gives different results when coded in 
SAS: I am not alone in overlooking it.  As a consequence, the maintainer expects to get 
new emails that ``we have found a bug in your code: it gives a different answer than 
SAS''.  (This is an actual quote.)


Terry Therneau

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] isssues with predict.coxph, offset, type = "expected", and newdata

2016-10-01 Thread Therneau, Terry M., Ph.D.
I'm off on vacation and checking email only intermittently.   
Wrt the offset issue, I expect that you are correct.  This is not a case that I 
had ever envisioned, and so was not on my "list" when writing the code and 
certainly has no test case.  That does not mean that it shouldn't work, just 
that I am not shocked to see it.   I will look into this.

For the second case of a NULL model I am less sympathetic.  This is, in theory, 
just reading off values from a Nelson hazard estimate at specific time points; 
using a coxph call to do so is a case of swatting a fly with a hammer.   A bit 
more background might make me more excited about extending the code to this 
case.

Terry Therneau

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] finegray function in the survival package

2016-09-07 Thread Therneau, Terry M., Ph.D.



On 09/07/2016 05:00 AM, r-help-requ...@r-project.org wrote:

Dear R-Team,

I have been trying to use the finegray routine that creates a special data
so that Fine and Gray model can be fit. However, it does not seem to work.
Could you please help me with this issue?


Thanks,
Ahalya.



You have given us no details of your example code that "doesn't work", and I can't read 
your mind.  So no, we can't help.  Give us a hint.


Terry T

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Conditional gap time frailty cox model for recurrent events

2016-09-06 Thread Therneau, Terry M., Ph.D.
You can ignore the message below.  The maximizing routine buried within the frailty() 
command buried with coxph() has a maximizer that is not the brightest.  It sometimes gets 
lost but then finds its way again.  The message is from one of those.  It likely took a 
not-so-good update step, and took a couple of iterations to recover.


In coxpenal.fit(X, Y, strats, offset, init = init, control, weights = weights,  
:
   Inner loop failed to coverge for iterations 3 4

To be fair the maximizing problem for a mixed effects Cox model is not easy.  In the coxme 
code I have spent much more time on the details of this.


Terry Therneau

--

On 09/06/2016 05:00 AM, r-help-requ...@r-project.org wrote:

Dear Elisabetta,

I have no direct answer to your question, but a suggestion: Use the
'coxme' function (in the package with the same name). In the help page
for 'frailty' (survival) you will find: "The coxme package has
superseded this method. It is faster, more stable, and more flexible."

Hth, G?ran

On 2016-09-05 11:42, Elisabetta Petracci wrote:

>Dear users,
>
>I am fitting a conditional gap time frailty cox model weighting
>observations by means of inverse probability time dependent weigths.
>Attached find the self-explaining dataset.
>
>I have used the following sintax:
>
>coxph(Surv(gaptstart,gaptstop,status)~treat+strata(nrecord01)+frailty(id,distribution="gamma",method="em"),
>data=dataNOTDrr,weights=dataNOTDrr$weight)
>
>
>And I get the following warning:
>
>Warning message:
>In coxpenal.fit(X, Y, strats, offset, init = init, control, weights =
>weights,  :
>   Inner loop failed to coverge for iterations 3 4
>
>
>I have tried to:
>- leave out the weights but I get the error anyway
>- to randomly select a subset of patients and I don't get the error. This
>seems to suggest that the problem is with some observations.
>
>Any suggestion?
>
>Many thanks,
>
>Elisabetta
>


__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] problem concernig Survsplit, package survival

2016-08-22 Thread Therneau, Terry M., Ph.D.


On 08/20/2016 05:00 AM, Vinzenz wrote:

For some days I have been struggling with a problem concerning the
?survSplit?-function of the package ?survival?. Searching the internet I
have found a pretty good -German- description of Daniel Wollschl?r
describing how to use survSplit:


The survSplit routine was recently updated to allow a formula as the first argument.  This 
change makes the routine much easier to use and more flexible.  Old forms of the call 
should have worked as well, but unfortunately I introduced a bug in the code.  For the 
time being, change your call to
   dfSurvCP <- survSplit(Surv(obsT, status) ~ ., dfSurv, cut=seq(30, 90, by=30), id="ID", 
zero=0)


I will fix this.  I apologize for the error.

Terry Therneau

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] Johannes Hengelbrock <j.hengelbr...@uke.de>

2016-06-04 Thread Therneau, Terry M., Ph.D.
I'm traveling so chasing this down more fully will wait until I get home.   
Four points.

1. This is an edge case.  You will notice that if you add "subset=1:100" to the 
coxph call that the function works perfectly.  You have to get up to 1000 or so 
before it fails.

2. The  exact partial likelihood of Cox is referred to as "exact" by coxph and 
as "discrete" by SAS phreg.  What phreg calls 'exact' is the exact marginal 
likelihood of Prentice.   I don't know which of these you were using in SAS so 
can't verify that "phreg works".

3. The computations and memory for the exact calculation go up phenomenally 
with the number of ties.   It is a sum over  n choose d terms,  if there were 
10 events out of 200 at risk this is a sum over all ways to choose 10 subjects 
out of 200 which is approx 2e16 terms.  Your example requires all choices of 
1541 out of 3000, which I would expect to take somewhere near 
age-of-the-universe seconds to compute.   The code uses a clever nested 
compuation due to Gail et al which will cut that time down to infinity/10.

4. This example drove coxph into a memory fault, I suspect.  I will certainly 
look into patching that once I get home.  (There is a check for this but it  
must have a flaw).  My sympathy is for your plight is low, however.  I  can't 
conceive of the real data problem where someone would actually need to compute 
this awful likelihood.  1541 events tied at the same time?   Or even more to 
imagine a case where I would need it badly enough to wait a lifetime for the 
answer.  The Efron approximation is pretty darn good for cases like this, and 
it is fast.

Terry Therneau

---

Dear users,

I'm trying to estimate a conditional logistic model using the
coxph()-function from the survival package. Somehow, the model does not
converge if time is set to the same value for all observations:

 library(survival)
 set.seed(12345)
 n <- 3000
 a <- rbinom(n, 1, 0.5)
 b <- rbinom(n, 1, 0.5)
 coxph(formula = Surv(rep(1, 3000), a) ~ b, method = "exact")

Error in fitter(X, Y, strats, offset, init, control, weights = weights,
: NA/NaN/Inf in foreign function call (arg 5) In addition: Warning
message: In fitter(X, Y, strats, offset, init, control, weights =
weights, :Ran out of iterations and did not converge

Changing iter.max does not help, aparently. Strangely, the exact same
model converges in SAS.

I know that I could estimate the model differently (via glm), but I
would like to understand why the model does converge in SAS but not in R.

Thanks,
Johannes

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] [R-pkgs] Survival 2.39

2016-04-21 Thread Therneau, Terry M., Ph.D.
A new version of the survival package has been released.  The biggest change is stronger 
support for multi-state models, which is an outgrowth of their increasing use in my own 
practice. Interested users are directed to the "time dependent covariates" vignette for 
discussion of the tmerge and survSplit functions, which are useful tools to build the 
requisite data sets, and to the "multi-state" vignette for model fits and plots.


Terry Therneau

___
R-packages mailing list
r-packa...@r-project.org
https://stat.ethz.ch/mailman/listinfo/r-packages

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] simple interactions

2016-04-15 Thread Therneau, Terry M., Ph.D.

I was right that there is an easy answer!

Thanks for the 3 quick answers, all three correct and useful.

Terry Therneau


On 04/15/2016 07:15 AM, Thierry Onkelinx wrote:

Dear Terry,

Does fitting group + age:group instead of age*group solves your problem?

Best regards,

ir. Thierry Onkelinx


2016-04-15 13:58 GMT+02:00 Therneau, Terry M., Ph.D. <thern...@mayo.edu

<mailto:thern...@mayo.edu>>:

I'd like to get interaction terms in a model to be in another form. Namely, 
suppose I
had variables age and group, the latter a factor with levels A, B, C, with  
age *
group in the model.  What I would like are the variables "age:group=A", 
"age:group=B"
and "age:group=C"  (and group itself of course).  The coefficients of the 
model will
then be the age effect in group A, the age effect in group B and the age 
effect in C
rather than the standard ones of an overall age effect followed by 
contrasts.  These
is often a better format for tables in a publication.

Yes, I can reconstruct these from the original fit, but I have a lot of 
variables for
several models and it would be easier to have an automatic form.  I suspect 
that there
is an easy answer, but I don't see it.

Terry Therneau


__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] simple interactions

2016-04-15 Thread Therneau, Terry M., Ph.D.
I'd like to get interaction terms in a model to be in another form.  Namely, suppose I had 
variables age and group, the latter a factor with levels A, B, C, with  age * group in the 
model.  What I would like are the variables "age:group=A", "age:group=B" and 
"age:group=C"  (and group itself of course).  The coefficients of the model will then be 
the age effect in group A, the age effect in group B and the age effect in C rather than 
the standard ones of an overall age effect followed by contrasts.  These is often a better 
format for tables in a publication.


Yes, I can reconstruct these from the original fit, but I have a lot of variables for 
several models and it would be easier to have an automatic form.  I suspect that there is 
an easy answer, but I don't see it.


Terry Therneau

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Using R for cURL commands,Message-ID:

2016-04-04 Thread Therneau, Terry M., Ph.D.



On 04/02/2016 05:00 AM, r-help-requ...@r-project.org wrote:

Hello,


I'm looking for a way in which R can make my live easier.

Currently i'm using R convert data from a dataframe to json's and then sending 
these json's to a rest api using a curl command in the terminal (i'm on a mac).


I've been looking for a way to use R for sending data from R to the rest api. 
My primairy focus was on using R for executing the curl command, however, I'm 
open to other approaches. The method I've been using so far:



-
These are lines snipped from a working application.  I make use of the httr package.  The 
parameters of the call are assembled as a nested list "xlist" prior to this section.  The 
set of errors I check for is unique to the API I'm talking to, an internal one that's a 
bit odd in its return codes, but still it gives you the idea.


auth <- authenticate(id$lanid, id$password, type="basic")
query <- POST(url=control$posturl, body= xlist, auth, encode="json")

if (status_code(query) >= 300) handle_reset(control$posturl)
if (status_code(query) == 401)
stop("invalid lanid/password for this application")
else if (status_code(query) == 400)
 stop("internal error from dart package: 'syntax of request'")
else stop_for_status(query)  #other query errors, e.g. server down

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Convergence issues when using ns splines (pkg: spline) in Cox model (coxph) even when changing coxph.control

2016-03-31 Thread Therneau, Terry M., Ph.D.
Thanks to David for pointing this out.  The "time dependent covariates" vignette in the 
survival package has a section on time dependent coefficients that talks directly about 
this issue.  In short, the following model is simply wrong:

 coxph(Surv(time, status) ~ trt + prior + karno + I(karno * log(time)), 
data=veteran)

People try this often as a way to create the time dependent covariate  Karnofsky * log(t), 
which is often put forwards as a way to deal with non-proportional hazards.  To do this 
correctly you have to use the tt() functionality in coxph to move the computation out of 
the model statement:

  coxph(Surv(time, status) ~ trt + prior + karno + tt(karno), data=veteran,
tt = function(x, time, ...) x*log(time))


BTW the following SAS code is also wrong:
 proc phreg data=veteran;
 model time * status(0) = trt + prior + karno* time;

SAS does the right thing, however, if you move the computation off the model 
line.
  model time * status(0) = trt + karno + zzz;
  zzz = karno * time;

The quote "SAS does it but R fails" comes at me moderately often in this context.  The 
reason is that SAS won't LET you put a phrase like "log(time)" into the model statement, 
so people end up doing the right thing, but by accident.


Terry T.


On 03/30/2016 05:28 PM, Göran Broström wrote:



On 2016-03-30 23:06, David Winsemius wrote:



On Mar 29, 2016, at 1:47 PM, Jennifer Wu, Miss
 wrote:

Hi,

I am currently using R v3.2.3 and on Windows 10 OS 64Bit.

I am having convergence issues when I use coxph with a interaction
term (glarg*bca_py) and interaction term with the restricted cubic
spline (glarg*bca_time_ns). I use survival and spline package to
create the Cox model and cubic splines respectively. Without the
interaction term and/or spline, I have no convergence problem. I
read some forums about changing the iterations and I have but it
did not work. I was just wondering if I am using the inter.max and
outer.max appropriately. I read the survival manual, other R-help
and stackoverflow pages and it suggested changing the iterations
but it doesn't specify what is the max I can go. I ran something
similar in SAS and did not run into a convergence problem.

This is my code:

bca_time_ns <- ns(ins_ca$bca_py, knots=3,
Boundary.knots=range(2,5,10)) test <- ins_ca$glarg*ins_ca$bca_py
test1 <- ins_ca$glarg*bca_time_ns


In your `coxph` call the variable 'bca_py' is the survival time and


Right David: I didn't notice that the 'missing main effect' in fact was part of 
the
survival object! And as you say: Time to rethink the whole model.

Göran


yet here you are constructing not just one but two interactions (one
of which is a vector but the other one a matrix) between 'glarg' and
your survival times. Is this some sort of effort to identify a
violation of proportionality over the course of a study?

Broström sagely points out that these interactions are not in the
data-object and subsequent efforts to refer to them may be confounded
by the multiple environments from which data would be coming into the
model. Better to have everything come in from the data-object.

The fact that SAS did not have a problem with this rather
self-referential or circular model may be a poor reflection on SAS
rather than on the survival package. Unlike Therneau or Broström who
asked for data, I suggest the problem lies with the model
construction and you should be reading what Therneau has written
about identification of non-proportionality and identification of
time dependence of effects. See Chapter 6 of his "Modeling Survival
Data".





__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] convergence issues with coxph

2016-03-30 Thread Therneau, Terry M., Ph.D.
Failure to converge in a coxph model is very rare.  If the program does not make it in 20 
iterations it likely will never converge, so your control argument will do little.


Without the data set I have no way to guess what is happening.  My first question, 
however, is to ask how many events you have, e.g. table(bca).  I count 19 covariates on 
the right hand side, and a good rule of thumb is that one should have at least 5- 10 
endpoints per covariate for simple numerical stability and 10-20 for statistical 
stability.  That means 100-200 events.  Most medical data sets have fewer than this. A 
data set with 5000 rows and 4 death counts as "4" in this calculation by the way.


  I am always interested in data sets that push the boundaries of the code and can look 
deeper if you want to send me a copy.  Details of how things are coded can matter, e.g., 
centered covariates.  Otherwise there is little we can do.


Terry Therneau


On 03/30/2016 05:00 AM, r-help-requ...@r-project.org wrote:

I am having convergence issues when I use coxph with a interaction term 
(glarg*bca_py) and interaction term with the restricted cubic spline 
(glarg*bca_time_ns). I use survival and spline package to create the Cox model 
and cubic splines respectively. Without the interaction term and/or spline, I 
have no convergence problem. I read some forums about changing the iterations 
and I have but it did not work. I was just wondering if I am using the 
inter.max and outer.max appropriately. I read the survival manual, other R-help 
and stackoverflow pages and it suggested changing the iterations but it doesn't 
specify what is the max I can go. I ran something similar in SAS and did not 
run into a convergence problem.

This is my code:

bca_time_ns <- ns(ins_ca$bca_py, knots=3, Boundary.knots=range(2,5,10))
test <- ins_ca$glarg*ins_ca$bca_py
test1 <- ins_ca$glarg*bca_time_ns

coxit <- coxph.control(iter.max=1, outer.max=1)

bca<-coxph(Surv(bca_py,bca) ~ glarg + test + test1 + age + calyr + diab_dur + 
hba1c + adm_met + adm_sulfo + adm_tzd + adm_oth +
 med_statin + med_aspirin + med_nsaids + bmi_cat + ccscat + alc + smk, 
data=ins_ca, control=coxit, ties=c("breslow"))


__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] plotting spline term when frailty term is included

2016-03-02 Thread Therneau, Terry M., Ph.D.



On 03/02/2016 05:00 AM, r-help-requ...@r-project.org wrote:

I'd very much appreciate your help in resolving a problem that I'm having with 
plotting a spline term.

I have a Cox PH model including a smoothing spline and a frailty term as 
follows:

fit<-coxph(Surv(start,end,exit) ~ x + pspline(z) + frailty(a))

When I run a model without a frailty term, I use the following in order to plot 
the splines:

termplot(fit, term = 2, se = TRUE, ylab = "Log Hazard", rug=TRUE, xlab = 
"z_name")

However, when the frailty term is included, it gives this error:

Error in pred[, first] : subscript out of bounds

What am I doing wrong here? Or is it the case that termplot does not work when 
splines and frailty are included?


There are 3 parts to the answer.
 1. The first is a warning: wrt to mixed effects Cox models, I shifted my energy to coxme 
over 10 years ago.  The "penalized add on to coxph" approach of the frailty function was 
an ok first pass, but is just too limited for any but the simplest models.  I'm unlikely 
fix issues, since there are others much higher on my priority list.


 2. As Dave W. pointed out, the key issue is that predict(type='terms') does not work 
with for a model with two penalized terms, when one is frailty and the other pspline. 
Termplot depends on predict.


 3. Again, as Dave W pointed out, the whole issue of what the "correct" answer would be 
gets much more complicated when one adds random effects to the mix; some things have not 
done because it is not clear where to go.  (Survival curves for a mixed effects model only 
recently got added to my "todo" list, even though it has been on the wish list forever, 
because I finally have a notion of what a good approach would be.)


In your case I'd advise an end run: fit the model using ns() instead of pspline.  I like 
smoothing splines better than regression splines, but the fact is that for most data sets 
they result in nearly identical answers.


Terry T

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Estimating Mean of a Poisson Distribution Based on interval censoring

2016-02-15 Thread Therneau, Terry M., Ph.D.
For an interval censored poisson or lognormal, use survreg() in the survival package.  (Or 
if you are a SAS fan use proc lifereg).  If you have a data set where R and SAS give 
different answers I'd like to know about it, but my general experience is that this is 
more often a user error.  I am also curious to learn exactly what you mean by "interval 
censored poisson".  Exponential with interval time to first event is equivalent to 
poisson, which is what I'd guess from "lognormal", but you may mean something else.



Terry Therneau
(author of survival)



On 02/14/2016 05:00 AM, r-help-requ...@r-project.org wrote:

Dear all,
I appreciate that if you let me know if there is any package implemented in R 
for Estimating Mean of a Poisson Distribution Based on Interval censoring? And 
if yes, could you please provide some information about it:)?
By the way, is there anything for lognormal?I think fitdistcens is not good for 
this purpose as it gives me different result compared to SAS and only useful 
for right/left censoring and not interval censoring?(or both left and right 
together).?
Kind regards,Mohsen


__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Survival::coxph (clogit), survConcordance vs. summary(fit) concordance

2016-01-21 Thread Therneau, Terry M., Ph.D.
I read the digest form which puts me behind, plus the last 2 days have been solid meetings 
with an external advisory group so I missed the initial query.   Three responses.


1. The clogit routine sets the data up properly and then calls a stratified Cox model.  If 
you want the survConcordance routine to give the same answer, it also needs to know about 
the strata

survConcordance (Surv(rep(1, 76L), resp) ~ predict(fit) + strata(ID), 
data=dat)
I'm not surprised that you get a very different answer with/without strata.

2. I've never thought of using a robust variance for the matched case/control model.  I'm 
having a hard time wrapping my head around what you would expect that to accomplish 
(statistically).  Subjects are already matched on someone from the same site, so where 
does a per-site effect creep in?  Assuming there is a good reason and I just don't see it 
(not an unwarranted assumption), I'm not aware of any work on what an appropriate variance 
would be for the concordance in that case.


3. I need to think about the large variance issue.

Terry Therneau


On 01/20/2016 08:09 PM, r-help-requ...@r-project.org wrote:

Hi,

I'm running conditional logistic regression with survival::clogit. I have
"1-1 case-control" data, i.e., there is 1 case and 1 control in each strata.

Model:
fit <- clogit(resp ~ x1 + x2, strata(ID), cluster(site), method ="efron",
data = dat)
Where resp is 1's and 0's, and x1 and x2 are both continuous.

Predictors are both significant. A snippet of summary(fit):
Concordance= 0.763  (se = 0.5 )
Rsquare= 0.304   (max possible= 0.5 )
Likelihood ratio test= 27.54  on 2 df,   p=1.047e-06
Wald test= 17.19  on 2 df,   p=0.0001853
Score (logrank) test = 17.43  on 2 df,   p=0.0001644,   Robust = 6.66
  p=0.03574

The concordance estimate seems good but the SE is HUGE.

I get a very different estimate from the survConcordance function, which I
know says computes concordance for a "single continuous covariate", but it
runs on my model with 2 continuous covariates

survConcordance(Surv(rep(1, 76L), resp) ~ predict(fit), dat)
n= 76
Concordance= 0.9106648 se= 0.09365047
concordant  discordant   tied.risk   tied.timestd(c-d)
  1315.   129. 0.   703.   270.4626

Are both of these concordance estimates valid but providing different
information?
Is one more appropriate for measuring "performance" (in the AUC sense) of
conditional logistic models?
Is it possible that the HUGE SE estimate represents a convergence problem
(no warnings were thrown when fit the model), or is this model just useless?

Thanks!


__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] clogit and weights

2015-12-22 Thread Therneau, Terry M., Ph.D.
How should the weights be treated?  If they are multiple observation weights (a weight of 
"3" is shorthand for 3 subjects) that leads to a different likelihood than sampling 
weights ("3" means to give this one subject more influence).  The clogit command can't 
read your mind and so has chosen not to make a guess.


Also, please do not post in html.  As you see below it leads to a mangled 
message.

Terry Therneau


On 12/22/2015 05:00 AM, r-help-requ...@r-project.org wrote:

Merry Christmas everyone:
I have the following data(mydat) and would like to fit a conditional logistic regression 
model considering "weights".
id? case?exposure?? weights
1?1?1? 2
1?0?0? 2
2?1?1? 2
2?0?0? 2
3?1?1? 1
3?0?0? 1
4?1?0 ?2
4?0?1? 2 ?The R function"clogit" is for such purposes but it ignores 
weights.?I tried function"mclogit" instead which seems that it considers the weights 
option:##options(scipen=999)library(mclogit)#
 create the above data frameid? = c(1,1,2,2,3,3,4,4)case? =?c(1,0,1,0,1,0,1,0)exposure 
= c(1,0,1,0,1,0,0,1)weights? = c(2,2,2,2,1,1,2,2)(mydata??= data.frame(id,case,exposure,weights)) 
fit??= mclogit(cbind(case,id) ~ exposure,weights=weights, 
data=mydata)summary(fit)##
The answer,however,?doesn't seem to be?correct. Could anyone?pleaseprovides me 
with some solution to this??Thanks in advance,Keramat Nourijelyani,PhD??

[[alternative HTML version deleted]]



__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Survival analysis: ERROR: Time and status are different lengths

2015-12-04 Thread Therneau, Terry M., Ph.D.
I expect that reading the result of print(fit.weib) will answer your question.  If there 
were any missing values in the data set, then the fit.weib$linear.predictors will be 
shorter than the original data set,

and the printout will have a note about "...deleted due to missing".

The simplest solution to this is to set
  options(na.action="na.exclude")
before doing the fit.  Then predict(fit) and resid(fit) will return vectors of the same 
length as the input data, containing NA in the appropriate positions.  The default 
na.action of "na.omit" leaves missing out of both the fit and the residuals.


(Unfortunately, only a few modeling functions in R pay attention to the difference between 
these two na.action options.)


Terry Therneau


On 12/04/2015 05:00 AM, r-help-requ...@r-project.org wrote:

Hi,

I am fitting an AFT model assuming a Weibull distribution and I would like
to check the residuals compared to the Kaplan Meier residuals, but when I
try to create the Kaplan Meier residuals I get an error: Time and status
are different lengths.

I am using the following script:

# Fitting the AFT model
fit.weib <- survreg(Surv(TimeDeath, event) ~ age + sex + mutation +
ventilation + BM1 + BM2, data = DF, dist = "weibull")
fits <- fit.weib$linear.predictors
resids <- (fit.weib$y[, 1] - fits)/fit.weib$scale
resKM <- survfit(Surv(resids, event) ~ 1, data = DF)

I get the error from the last line of the script.

I tried some things that didn't work and I was wondering if anyone could
help me.
If you need more information please let me know.

Thanks in advance,


__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] SWEAVE - a gentle introduction

2015-11-18 Thread Therneau, Terry M., Ph.D.

As a digest reader I am late to the discussion, but let me toss in 2 further 
notes.

1. Three advantages of knitr over Sweave
  a. The book "Dynamic documents with R and knitr".  It is well written; sitting down for 
an evening with the first half (70 pages) is a pretty good way to learn the package.  The 
second half covers features you may or may not use over time.  My only complaint about the 
book is that it needs a longer index; I have had several cases of "I know I read about 
xxx, but where was it?".  But this is ameliorated by


  b. Good online resources: manuals, tutorials, and question/answer pairs on 
various lists.

  c. Ongoing support. Sweave is static.

2. Latex vs markdown  (knitr supports both)
  One can choose "latex style" or "markdown style" for writing your documents.  I know 
latex very well (I wrote my book using it) but recommend markdown to all others in my 
department.  The latter is about 1/3 the learning curve.  Markdown produces very nice 
output, latex goes the extra mile to produce true book quality.  But one rarely needs that 
extra polish, and even more rarely needs it enough to justify the extra learning cost.  I 
still use the latex form myself as it is not at all difficult to use --- once you learn it.


Terry Therneau

On 11/18/2015 05:00 AM, r-help-requ...@r-project.org wrote:

I am looking for a gentle introduction to SWEAVE, and would appreciate 
recommendations.
I have an R program that I want to run and have the output and plots in one 
document. I believe this can be accomplished with SWEAVE. Unfortunately I don't 
know HTML, but am willing to learn. . . as I said I need a gentle introduction 
to SWEAVE.
Thank you,
John


__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Error survreg: Density function returned an an invalid matrix

2015-11-16 Thread Therneau, Terry M., Ph.D.
The error message states that there is an invalid value for the density.  A long stretch 
of code is not very helpful in understanding this.  What we need are the definition of 
your density -- as it would be written in a textbook.  This formula needs to give a valid 
response for the range -infinity to +infinity.  Or more precisely, for any value that the 
maximizer might guess at some point during the iteration.


Terry T.


On 11/14/2015 05:00 AM, r-help-requ...@r-project.org wrote:

Thanks Terry but the error persists. See:


>library(foreign)> library(survival)> library(VGAM) > mypareto <- 
list(name='Pareto',+  init=


remainder of message trucated

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] recursive partitioning in R

2015-11-12 Thread Therneau, Terry M., Ph.D.
Look at the rpart vignette "User written split functions".  The code allows you to add 
your own splitting method to the code (in R, no C required).  This has proven to be very 
useful for trying out new ideas.


The second piece would be to do your own cross-validation.  That is, turn off the built in 
cross-validation using the xval=0 option, then explicitly do the cross-validation 
yourself.  Fit a new tree to some chosen subset of data, using your split rule of course, 
and then use predict() to get predicted values for the remaining observations. Again, this 
is all in R, and you can explicitly control your in or out of bag subsets.

The xpred.rpart function may be useful to automate some of the steps.

If you look up rpart on CRAN, you will see a link to the package source.  If you were to 
read the C source code you will discover that 95% is boring bookkeeping of what 
observations are in what part(s) of the tree, sorting the data, tracking missing values, 
etc.  If you ever do want to write your own code you are more than welcome to build off 
this --- I wouldn't want to write that part again.


Terry Therneau

On 11/12/2015 05:00 AM, r-help-requ...@r-project.org wrote:

Dear List,

I'd like to make a few modifications to the typical CART algorithm, and
I'd rather not code the whole thing from scratch.  Specifically I want
to use different in-sample and out-of-sample fit criteria in the split
choosing and cross-validation stages.

I see however that the code for CART in both the rpart and the tree
packages is written in C.

Two questions:

   * Where is the C code?  It might be possible to get a C-fluent
 programmer to help me with this.
   * Is there any code for CART that is written entirely in R?

Thanks,
Andrew


__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Error survreg: Density function returned an an invalid matrix

2015-11-04 Thread Therneau, Terry M., Ph.D.

Hi, I want to perform a survival analysis using survreg procedure from
survival library in R for a pareto distribution for a time variable, so I
set the new distribution using the following sintax:

 library(foreign)
 library(survival)
 library(VGAM)

 mypareto <- list(name='Pareto',
  init= function(x, weights,parms){

etc.

The survreg routine fits location-scale distributions such that (t(y) - Xb)/s ~ F, where t 
is an optional transformation, F is some fixed distribution and X is a matrix of 
covariates.  For any distribution the questions to ask before trying to add the 
distribution to survreg are

  - can it be written in a location-scale form?
  - if so, how do the parameters of the distribution map to the location (Xb) 
and scale (s).

In fitting data we normally have per-subject location (X b) but an intercept-only model is 
of course possible.


If y is Weibull then log(y) fits into the framework, which is how survreg fits it.  The 
transformation of parameters location and scale parameters for log(y) back to the usual 
Weibull parameterization for y often trips people up (see comments in the Examples section 
of ?survreg).


The log of a Pareto can be written in this form (I think?).  The two parameters are the 
scale a and lower limit b, with survival function of S(x)= (b/x)^a, for x >= b.  If y = 
log(x) the survival function for y is
S(y) = (b/exp(y))^a = exp[-(y - log(b))/(1/a)], which has location log(b) and scale 1/a. 
But even if I am correct the discontinuity at b will cause the underlying Newton-Raphson 
method to fail.


 Terry Therneau

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] coxme adequacy check

2015-10-28 Thread Therneau, Terry M., Ph.D.



On 10/28/2015 06:00 AM, r-help-requ...@r-project.org wrote:

Hello all!

I?m fitting a mixed effects cox model with coxme function of coxme package.
I want to konw what is the best way to check the model adequacy, once that
function cox.zph that does not work for coxme objects.

Thanks in advanced,

Raoni


No one has done the theory work to see if the method used by coxme extends to random 
effects models.  I suspect that it does, but that is a long way from a proof.  So at this 
point I don't have a good suggestion, even though I have been thinking about the problem.


Terry Therneau

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Fitting a curve to weibull distribution in R using nls

2015-10-14 Thread Therneau, Terry M., Ph.D.



On 10/14/2015 05:00 AM, r-help-requ...@r-project.org wrote:

I am trying to fit this data to a weibull distribution:

My y variable is:1  1  1  4  7 20  7 14 19 15 18  3  4  1  3  1  1  1
1  1  1  1  1  1

and x variable is:1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18
19 20 21 22 23 24



One could always use existing R functions that fit a Weibull regression, instead of 
reinventing the wheel.


library(survival)
y <- scan()
1  1  1  4  7 20  7 14 19 15 18  3
4  1  3  1  1  1
1  1  1  1  1  1

wfit <- survreg(Surv(1:24) ~ 1, weights=y, dist="weibull")
wfit

Call:
survreg(formula = Surv(1:24) ~ 1, weights = y, dist = "weibull")

Coefficients:
(Intercept)
   2.352121

Scale= 0.4130924

Loglik(model)= -351.4   Loglik(intercept only)= -351.4
n= 24


zz <- seq(0, 25, length=100)
plot(zz, dsurvreg(zz, 2.352121, 0.4130924), col=2, type='l', ylim=c(0, .15),
xlab="Value", ylab="Density")
points(1:24, y/sum(y))

-
There are a half dozen ways to parameterize a Weibull distribution; the location-scale 
form used by survreg is one of the less common.  See help(survreg) for more information -- 
look at the example near the bottom of the page.


Terry Therneau

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] rpart cutpoint interpretation

2015-10-08 Thread Therneau, Terry M., Ph.D.
The cutpoint is on the predictor, so the interpretation is the same as it is for any other 
rpart model.  The subjects with predictor < cutpoint form one group and those > cutpoint 
the other.  The cutpoint is chosen to give the greatest difference in "average y" between 
the groups.  For poisson "averge y" is an event rate.


On 10/08/2015 05:00 AM, r-help-requ...@r-project.org wrote:

I am trying to derive cutpoint/threshold with a poisson distributed
dependent variable. I know how to interpret cutpoint with binary dependent
variable based on direction. Can some on help me to intrepret cutpoint for
poisson case with one independent variable with the derived threshold.


__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] flatten a list

2015-09-29 Thread Therneau, Terry M., Ph.D.
I'd like to flatten a list from 2 levels to 1 level.  This has to be easy, but 
is currently opaque to me.

temp <- list(1:3, list(letters[1:3], duh= 5:8),  zed=15:17)

Desired result would be a 4 element list.
[[1]] 1:3
[[2]] "a", "b", "c"
[[duh]] 5:8
[[zed]] 15:17

(Preservation of the names is not important)

Terry T
__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] XPT files

2015-09-28 Thread Therneau, Terry M., Ph.D.
This was an FDA/SAS bargain a long while ago.  SAS made the XPT format publicly available 
and unchanging in return for it becoming a standard for submission.  Many packages can 
reliably read or write these files.  (The same is not true for other SAS file formats, nor 
is xport the SAS default.)  I do not know how good the R routines are, having never used them.


The following snippit is taken from
  
http://www.fda.gov/downloads/ForIndustry/DataStandards/StudyDataStandards/UCM312964.pdf

2 Dataset Specifications
2.1 File Format

SAS XPORT transport file format, also called Version 5 SAS transport format, is an open 
format published by the SAS Institute. The description of this SAS transport file format 
is in the public domain. Data can be translated to and from this SAS transport format to 
other commonly usedformats without the use of programs from SAS Institute or any specific 
vendor.


Sponsors can find the record layout for SAS XPORT transport files through SAS technical 
support technical document TS-140. This document and additional information about the SAS 
Transport file layout can be found on the SAS World Wide Web page at 
http://www.sas.com/fda-esub.


---
Said document TS-140  talks about IBM 360 and Dec VAX machines but no others, which should 
give you an idea of its age.


Terry Therneau


On 09/27/2015 05:00 AM, r-help-requ...@r-project.org wrote:

Peter

Thanks for the explanation.  One further comment ? you wrote:

>I don't think the FDA "requests" XPT files

In fact, they do make such a request.  Here is the actual language received 
this week (and repeatedly in the past):

>Program/script files should be submitted using text files (*.TXT) and the data 
should be submitted using SAS transport files (*.XPT).

Dennis


__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] retaining characters in a csv file

2015-09-23 Thread Therneau, Terry M., Ph.D.

Thanks for all for the comments, I hadn't intended to start a war.

My summary:
  1. Most important: I wasn't missing something obvious.  This is always my first 
suspicion when I submit something to R-help, and it's true more often than not.


  2. Obviously (at least it is now), the CSV standard does not specify that quotes should 
force a character result.  R is not "wrong".  Wrt to using what Excel does as litmus test, 
I consider that to be totally uninformative about standards: neither pro (like Duncan) or 
anti (like Rolf), but simply irrelevant.  (Like many MS choices.)


  3. I'll have to code in my own solution, either pre-scan the first few lines to create 
a colClasses, or use read_csv from the readr library (if there are leading zeros it keeps 
the string as character, which may suffice for my needs), or something else.


  4. The source of the data is a "text/csv" field coming from an http POST request.  This 
is an internal service on an internal Mayo server and coded by our own IT department; this 
will not be the first case where I have found that their definition of "csv" is not quite 
standard.


Terry T.




On 23/09/15 10:00, Therneau, Terry M., Ph.D. wrote:

I have a csv file from an automatic process (so this will happen
thousands of times), for which the first row is a vector of variable
names and the second row often starts something like this:

5724550,"000202075214",2005.02.17,2005.02.17,"F", .

Notice the second variable which is
   a character string (note the quotation marks)
   a sequence of numeric digits
   leading zeros are significant

The read.csv function insists on turning this into a numeric.  Is there
any simple set of options that
will turn this behavior off?  I'm looking for a way to tell it to "obey
the bloody quotes" -- I still want the first, third, etc columns to
become numeric.  There can be more than one variable like this, and not
always in the second position.

This happens deep inside the httr library; there is an easy way for me
to add more options to the read.csv call but it is not so easy to
replace it with something else.


__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Accounting for correlated random effects in coxme

2015-09-22 Thread Therneau, Terry M., Ph.D.

I've been away for a couple weeks and am now catching up on email.

The issue is that the coxme code does not have conversions built-in for all of the 
possible types of sparse matrix.  Since it assumes that the variance matrix must be 
symmetric, the non-neccarily-symmetric dgCMatrix class is not one that I had considered. 
You should transform it to dsCMatrix first, which is a symmetric

class.  Or if it is small enough, to a simple matrix.

Terry T.


On 09/22/2015 05:00 AM, r-help-requ...@r-project.org wrote:

I have a problem with running the mixed effects Cox regression model using
a distance matrix from a phylogeny rather than a pedigree. I searched
previous posts and didn't find any directly relevant previous posts.

I am interested in using a mixed effects Cox regression model to determine
the best predictors of time to recruitment in 80 different reintroduced
plant populations representing a total of 31 species. I will like to
account for correlated random effects that result from phylogenetic
relationships amongst species. Dr. Therneau's 2015 article on Mixed Effects
Cox Models provide a very helpful template for me to do this with the coxme
function in R. In this article, the correlation structure due to genetic
relationships amongst individuals was defined using a kinship matrix
derived from a pedigree. Instead of a pedigree, I have a phylogeny for
these 31 species. Hence, I used the inverseA function in the MCMCglmm
package to generate an inverse additive genetic relatedness matrix from the
phylogeny for these 31 species. And then fed it in as input to the varlist
argument in my mixed effects cox regression model (using function coxme). I
got an error message (please see below). Based on the error, one thought I
had was to convert the inverseA matrix from a ?dgCMatrix? to ?bdsmatrix?
but this was not successful either. I have also unsuccessfully tried to use
a pairwise phylogenetic distance matrix.

Is there a better way to do this? I basically just want to account for the
correlated random effects due to phylogenetic relatedness amongst the 31
species represented in the dataset for the Cox regression model.  Please
see my code below and I welcome suggestions on how best to make this work.


__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] retaining characters in a csv file

2015-09-22 Thread Therneau, Terry M., Ph.D.
I have a csv file from an automatic process (so this will happen thousands of times), for 
which the first row is a vector of variable names and the second row often starts 
something like this:


5724550,"000202075214",2005.02.17,2005.02.17,"F", .

Notice the second variable which is
  a character string (note the quotation marks)
  a sequence of numeric digits
  leading zeros are significant

The read.csv function insists on turning this into a numeric.  Is there any simple set of 
options that
will turn this behavior off?  I'm looking for a way to tell it to "obey the bloody quotes" 
-- I still want the first, third, etc columns to become numeric.  There can be more than 
one variable like this, and not always in the second position.


This happens deep inside the httr library; there is an easy way for me to add more options 
to the read.csv call but it is not so easy to replace it with something else.


Terry T

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] using survreg() in survival package with "long" data

2015-08-31 Thread Therneau, Terry M., Ph.D.


On 08/30/2015 05:00 AM, r-help-requ...@r-project.org wrote:

I'm unable to fit a parametric survival regression using survreg() in the survival package with 
data in "counting-process" ("long") form.

To illustrate using a scaled-down problem with 10 subjects (with data placed on 
the web):



As usual I'm a day late since I read digests, and Goran has already clarified things.  A 
discussion of this is badly needed in my as yet unwrritten book on using the survival 
package.  From a higher level view:
  If an observation is interval censored (a,b) then one knows that the event happened 
between time "a" and time "b", but not when.  The survreg routine can handle interval 
censored data since it is parametric (you need to integrate over the interval).  The 
interval (-infinity, b) is called 'left censored' and the interval (a, infinity) is 'right 
censored'.  Left censored data is rare in medical work, an example might be a chronic 
disease like rhuematoid arthritis where we know that the true disease onset was some time 
before the date it was first detected, and one is trying to deduce the duration of disease.


  Left truncation at time 'a' means that any events before time "a" are not in the data 
set.  In a referral center like mine this includes any subjects who die before they come 
to us.  The coxph model handles left truncation naturally via its counting process 
formulation.  That same formulation also allows it to deal with time dependent 
covariates.   Accelerated failure time models like survreg can handle left truncation in 
principle, but they require that the values of any covariates are known from time 0 -- 
even for a truncated subject.   I have never added left-truncation to the survreg code, 
mostly because I have never needed it myself, but also because users would immediately 
think that they could accomplish time-dependent covariates by simply using a long format 
data set. Rather, each subject needs to be linked to a full covariate history, which is a 
bit more work.


 So:  coxph does left truncation but not left (or interval) censoring
  survreg does interval censoring but not left truncation (or time 
dependent covariates).

Terry T

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Survival analysis and predict time-to-death

2015-08-18 Thread Therneau, Terry M., Ph.D.
I read this list a day late as a digest so my answers are rarely the first.  (Which is 
nice as David W answers most of the survival questions for me!)


What you are asking is reasonable, and in fact is common practice in the realm of 
industrial reliability, e.g., Meeker and Escobar, Statistical Methods for Reliability 
Analysis.  Extrapolation of the survival curve to obtain the mean and percentiles of the 
lifetime distribution for some device (e.g. a washing machine) is their bread and butter, 
used for instance to determine the right size for an inventory of spare parts.  For most 
of us on this list who do medical statistics and live in the Kaplan-Meier/ Cox model world 
the ideas are uncommon.  I was lucky enough to sit through one of Bill Meeker's short 
courses and retain some (minimal) memory of it.


  1. You are correct that parametric models are essential.  If the extrapolation is 
substantial (30% or more censored, say), then the choice of distribution can be critical. 
 If failure is due to repeated insult, e.g., the multi-hit model, then Weibull tends to 
be preferred; if it is from degradation, e.g., flexing of a diaphram, then the log-normal. 
 Beyond this you need more guidance than mine.


  2. The survreg routine assumes that log(y) ~ covariates + error.  For a log-normal 
distribion the error is Gaussian and thus the predict(fit, type='response') will be 
exp(predicted mean of log time), which is not the predicted mean time.  For Weibull the 
error dist is asymmetric so things are more muddy.  Each is the MLE prediction for the 
subject, just not interpretable as a mean.  To get the actual mean you need to look up the 
formulas for Weibull and/or lognormal in a textbook, and map from the survreg 
parameterization to whatever one the textbook uses.  The two parameterizations are never 
the same.


  3. Another option is predicted quantiles.  ?predict.survreg shows how to get the entire 
survival curve.  The mean can be obtained as the area under the survival curve.  Relevant 
to your question, the expected time remaining for a subject still alive at time =10, say, 
is  integral(S(t), from 10 to infin) / S(10), where S is the survival curve.  You can also 
read off quantiles of the expected remaining life.


Terry Therneau
(author of the survival package)

On 08/18/2015 05:00 AM, r-help-requ...@r-project.org wrote:

Dear All,

I would like to build a model, based on survival analysis on some data, that
is able to predict the /*expected time until death*/ for a new data
instance.

Data
For each individual in the population I have the, for each unit of time, the
status information and several continuous covariates for that particular
time. The data is right censored since at the end of the time interval
analyzed, instances could be still alive and die later.

Model
I created the model using R and the survreg function:

lfit - survreg(Surv(time, status) ~ X)

where:
- time is the time vector
- status is the status vector (0 alive, 1 death)
- X is a bind of multiple vectors of covariates

Predict time to death
Given a new individual with some covariates values, I would like to predict
the estimated time to death. In other words, the number of time units for
which the individual will be still alive till his death.

I think I can use this:

ptime - predict(lfit, newdata=data.frame(X=NEWDATA), type='response')

Is that correct? Am I going to get the expected-time-to-death that I would
like to have?

In theory, I could provide also the time information (the time when the
individual has those covariates values), should I simply add that in the
newdata:

ptime - predict(lfit, newdata=data.frame(time=TIME, X=NEWDATA),
type='response')

Is that correct? Is this going to improve the prediction? (for my data, the
time already passed should be an important variable).

Any other suggestions or comments?

Thank you!


__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Differences in output of lme() when introducing interactions

2015-07-23 Thread Therneau, Terry M., Ph.D.
The following are in parody (but like all good parody correct wrt the salient features). 
The musings of

Guernsey McPhearson
   http://www.senns.demon.co.uk/wprose.html#Mixed
   http://www.senns.demon.co.uk/wprose.html#FDA


In formal publication:
 Senn, Statistical Issues in Drug Development, second edition, Chapter 14: 
Multicentre Trials
 Senn, The many modes of meta, Drug information journal, 34:535-549, 2000.

The second points out that in a meta analysis no one would ever consider giving both large 
and small trials equal weights, and relates that to several other bits of standard 
practice.  The 'equal weights' notion embedded in a fixed effects model + SAS type 3 is an 
isolated backwater.


Terry T.

PS. The Devils' Drug Development Dictionary at the same source has some gems. Three 
rather random choices:


Bayesian - One who, vaguely expecting a horse and catching a glimpse of a donkey, strongly 
concludes he has seen a mule.


Medical Statistician - One who won't accept that Columbus discovered America because he 
said he was looking for India in the trial Plan.


Trend Towards Significance - An ever present help in times of trouble.



On 07/22/2015 06:02 PM, Rolf Turner wrote:

On 23/07/15 01:15, Therneau, Terry M., Ph.D. wrote:

SNIP


3. Should you ever use it [i.e. Type III SS]?  No.  There is a very strong 
inverse
correlation between understand what it really is and recommend its
use.   Stephen Senn has written very intellgently on the issues.


Terry --- can you please supply an explicit citation?  Ta.

cheers,

Rolf



__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Differences in output of lme() when introducing interactions

2015-07-22 Thread Therneau, Terry M., Ph.D.
Type III is a peculiarity of SAS, which has taken root in the world.  There are 3 main 
questions wrt to it:


1. How to compute it (outside of SAS).  There is a trick using contr.treatment coding that 
works if the design has no missing factor combinations, your post has a link to such a 
description.  The SAS documentation is very obtuse, thus almost no one knows how to 
compute the general case.


2. What is it?  It is a population average.  The predicted average treatment effect in a 
balanced population-- one where all the factor combinations appeared the same number of 
times.  One way to compute 'type 3' is to create such a data set, get all the predicted 
values, and then take the average prediction for treatment A, average for treatment B, 
average for C, ...  and test are these averages the same.   The algorithm of #1 above 
leads to another explanation which is a false trail, in my opinion.


3. Should you ever use it?  No.  There is a very strong inverse correlation between 
understand what it really is and recommend its use.   Stephen Senn has written very 
intellgently on the issues.


Terry Therneau


On 07/22/2015 05:00 AM, r-help-requ...@r-project.org wrote:

Dear Michael,
thanks a lot. I am studying the marginality and I came across to this post:

http://www.ats.ucla.edu/stat/r/faq/type3.htm

Do you think that the procedure there described is the right one to solve my 
problem?

Would you have any other online resources to suggest especially dealing with R?

My department does not have a statician, so I have to find a solution with my 
own capacities.

Thanks in advance

Angelo


__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] matrix manipulation

2015-07-16 Thread Therneau, Terry M., Ph.D.
This is as much a mathematics as an R question, in the this should be easy but I don't 
see it category.


Assume I have a full rank p by p matrix V  (aside: V = (X'X)^{-1} for a particular setup), 
a p by k matrix B, and I want to complete an orthagonal basis for the space with distance 
function V.  That is, find A such that t(B) %*% V %*% A =0, where A has p rows and p-k 
columns.


With V=identity this is easy. I can do it in 1-2 lines using qr(), lm(), or several other 
tools.  A part of me is quite certain that the general problem isn't more than 3 lines of 
R, but after a day of beating my head on the issue I still don't see it.  Math wise it 
looks like a simple homework problem in a mid level class, but I'm not currently sure that 
I'd pass said class.


If someone could show the way I would be grateful.  Either that or assurance that the 
problem actually IS hard and I'm not as dense as I think.


Terry T.

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] matrix manipulation -solved

2015-07-16 Thread Therneau, Terry M., Ph.D.

Yes it is obvious --- once someone else pointed it out.
Thanks for the hint.

Terry T.


On 07/16/2015 12:52 PM, Peter Langfelder wrote:

Hi Terry,

maybe I'm missing something, but why not define a matrix BB = V'B;
then t(B) %*% V = t(BB), then your problem reduces to finding A such
that t(BB) %*% A = 0?

Peter

On Thu, Jul 16, 2015 at 10:28 AM, Therneau, Terry M., Ph.D.
thern...@mayo.edu wrote:

This is as much a mathematics as an R question, in the this should be easy
but I don't see it category.

Assume I have a full rank p by p matrix V  (aside: V = (X'X)^{-1} for a
particular setup), a p by k matrix B, and I want to complete an orthagonal
basis for the space with distance function V.  That is, find A such that
t(B) %*% V %*% A =0, where A has p rows and p-k columns.

With V=identity this is easy. I can do it in 1-2 lines using qr(), lm(), or
several other tools.  A part of me is quite certain that the general problem
isn't more than 3 lines of R, but after a day of beating my head on the
issue I still don't see it.  Math wise it looks like a simple homework
problem in a mid level class, but I'm not currently sure that I'd pass said
class.

If someone could show the way I would be grateful.  Either that or assurance
that the problem actually IS hard and I'm not as dense as I think.

Terry T.

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Variance estimates for survreg vs. lm

2015-07-06 Thread Therneau, Terry M., Ph.D.
The difference is that survreg is using a maximum likelihood estimate (MLE) of the 
variance and that lm is using the unbiased (MVUE) estimate of variance.  For simple linear 
regression, the former divides by n and the latter by n-p.  The difference in your 
variances is exactly n/(n-p) = 10/8.


With censored data the MLE estimate is still clear, but what exactly n is is not so 
obvious (does a censored datum count as a whole observation?), so a simple (n-p)/n 
variance correction is also not obvious.  I would not be surprised if someone, somewhere 
has hammered out a correction for unbiased variance; I've never looked for it.  I'm not 
convinced that it is worthwhile though.


Terry Therneau


On 07/04/2015 05:00 AM, r-help-requ...@r-project.org wrote:

I would like help understanding why a survival regression with no censored
data-points does not give the same variance estimates as a linear model
(see code below).

I think it must be something to do with the fact that the variance is an
actual parameter in the survival version via the log(scale), and possibly
that different assumptions are made about the distribution of the variance.
But I really don't know, I'm just guessing.

The reason I ask is because I am moving a process, that has always been
modelled using a linear model, to a survival model (because there are
sometimes a few censored data points). In the past, the censored data
points have been treated as missing which imparts bias. The variance of the
estimates in this process is key, so I need to know why they are changing
in this systematic way?!



library(survival)

ctl - c(4.17,5.58,5.18,6.11,4.50,4.61,5.17,4.53,5.33,5.14)
ctl.surv - Surv(ctl)

trt - c(4.81,4.17,4.41,3.59,5.87,3.83,6.03,4.89,4.32,4.69)

lmod - lm (ctl  ~ trt)
smod - survreg(ctl.surv ~ trt,dist=gaussian)

coef(lmod)
coef(smod) # same

vcov(lmod)
vcov(smod) # smod is smaller

diag(vcov(lmod)) /
diag(vcov(smod))[1:2]  # 1.25 == 0.5*(n/(n-1))

( summary(lmod)$coef [   ,Std. Error] /
   summary(smod)$table[1:2,Std. Error]   )^2# 1.25 = 0.5*(n/(n-1))


__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] prmatrix and print.matrix

2015-06-25 Thread Therneau, Terry M., Ph.D.
The help page for prmatrix states that it only exists for backwards compatability and 
strongly hints at using print.matrix instead.

 However, there does not seem to be a print.matrix() function.

The help page for print mentions a zero.print option, but that does not appear to affect 
matrices.  This (or something like it) is what I was looking for.


Am I overlooking something?

Terry Therneau

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Different behavior of model.matrix between R 3.2 and R3.1.1

2015-06-15 Thread Therneau, Terry M., Ph.D.

Frank,
  I don't think there is any way to fix your problem except the way that I 
did it.

library(survival)
tdata - data.frame(y=c(1,3,3,5, 5,7, 7,9, 9,13),
x1=factor(letters[c(1,1,1,1,1,2,2,2,2,2)]),
x2= c(1,2,1,2,1,2,1,2,1,2))

fit1 - lm( y ~ x1 * strata(x2) - strata(x2), tdata)
coef(fit1)
   (Intercept)x1b x1a:strata(x2)x2=2 x1b:strata(x2)x2=2
  3.00   5.00   1.00   1.67

Your code is calling model.matrix with the same model frame and terms structure as the lm 
call above (I checked).  In your case you know that the underlying model has 2 intercepts 
(strata), one for the group with x2=1 and another for the group with x2=2, but how is the 
model.matrix routine supposed to guess that?  It can't, so model.matrix returns the proper 
result for the lm call.  As seen above the result is not singular, while for the Cox model 
it is singular due to the extra intercept.


This is simply an extension of leaving the intercept term in the model and then removing 
that column from the returned X matrix, which is necessary to have the correct coding for 
ordinary factor variables, something we've both done since day 1.  In order for 
model.matrix to do the right thing with interactions, it has to know how many intercepts 
there actually are.


I've come to the conclusion that the entire thrust of 'contrasts' in S was wrong headed, 
i.e., the remove redundant columns from the X matrix ahead of time logic.  It is simply 
not possible for the model.matrix routine to guess correctly for all y and x combinations, 
something that been acknowledged in R by changing the default for singular.ok to TRUE. 
Dealing with this after the fact via a good contrast function (a la SAS -- heresy!) would 
have been a much better design choice.  But as long as I'm in R the coxph routine tries to 
be a good citizen.


Terry T.

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Different behavior of model.matrix between R 3.2 and R3.1.1

2015-06-11 Thread Therneau, Terry M., Ph.D.

Frank,
  I'm not sure what is going on.  The following test function works for me in both 3.1.1 
and 3.2, i.e, the second model matrix has fewer columns.  As I indicated to you earlier, 
the coxph code removes the strata() columns after creating X because I found it easier to 
correctly create the assign attribute.


  Can you create a worked example?

require(survival)
testfun - function(formula, data) {
tform - terms(formula, specials=strata)
mf - model.frame(tform, data)

terms2 - terms(mf)
strat - untangle.specials(terms2, strata)
if (length(strat$terms)) terms2 - terms2[-strat$terms]
X - model.matrix(terms2, mf)
X
}

tdata - data.frame(y= 1:10, zed = 1:10, grp = factor(c(1,1,1,2,2,2,1,1,3,3)))

testfun(y ~ zed*grp, tdata)

testfun(y ~ strata(grp)*zed, tdata)


Terry T.

- original message --

For building design matrices for Cox proportional hazards models in the
cph function in the rms package I have always used this construct:

Terms - terms(formula, specials=c(strat, cluster, strata), data=data)
specials - attr(Terms, 'specials')
stra- specials$strat
Terms.ns - Terms
 if(length(stra)) {
   temp - untangle.specials(Terms.ns, strat, 1)
   Terms.ns - Terms.ns[- temp$terms]#uses [.terms function
 }
X - model.matrix(Terms.ns, X)[, -1, drop=FALSE]

The Terms.ns logic removes stratification factor main effects so that
if a stratification factor interacts with a non-stratification factor,
only the interaction terms are included, not the strat. factor main
effects. [In a Cox PH model stratification goes into the nonparametric
survival curve part of the model].

Lately this logic quit working; model.matrix keeps the unneeded main
effects in the design matrix.  Does anyone know what changed in R that
could have caused this, and possibly a workaround?


---

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Unwanted unicode - solved

2015-06-04 Thread Therneau, Terry M., Ph.D.
Thanks to all.  The suggestion below by Marc Schwarz (the second I tried) showed the 
problem.  One of the references in one of the files had been pasted in and had a funky 
dash in its 111-196 page number.  It looked just fine in my emacs window so I hadn't 
picked it up.  There are 90 .Rd files so this saved me substantial time.


Terry T.


On 06/04/2015 03:00 PM, Marc Schwartz wrote:

On Jun 4, 2015, at 12:56 PM, Therneau, Terry M., Ph.D. thern...@mayo.edu 
wrote:


I'm checking the survival package and get the following error. How do I find 
the offending line?  (There are a LOT of files in the man directory.)

Terry T.

--

* checking PDF version of manual ... WARNING
LaTeX errors when creating PDF version.
This typically indicates Rd problems.
LaTeX errors found:
! Package inputenc Error: Unicode char \u8:‑ not set up for use with LaTeX.



Terry,

One possible option:

   require(tools)

   sapply(list.files(path = “Path/To/Your/RD/Files, pattern = .Rd),
  showNonASCIIfile)

See ?list.files and ?showNonASCIIfile

Regards,

Marc Schwartz



__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

[R] Unwanted unicode

2015-06-04 Thread Therneau, Terry M., Ph.D.
I'm checking the survival package and get the following error. How do I find the offending 
line?  (There are a LOT of files in the man directory.)


Terry T.

--

* checking PDF version of manual ... WARNING
LaTeX errors when creating PDF version.
This typically indicates Rd problems.
LaTeX errors found:
! Package inputenc Error: Unicode char \u8:‑ not set up for use with LaTeX.

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Logistic regression R and Stata grouping variable

2015-05-27 Thread Therneau, Terry M., Ph.D.
You were not completely clear, but it appears that you have data where each subject has 
results from 8 trials, as a pair of variables is changed.  If that is correct, then you 
want to have a variance that corrects for the repeated measures.  In R the glm command 
handles the simple case but not the repeated measures one.  Statisticially you can use a 
generalized estimating equations approach (package gee) or a random effect per subject 
approach (lme or lmer package).


Terry T.


On 05/27/2015 05:00 AM, r-help-requ...@r-project.org wrote:

I mostly use Stata 13 for my regression analysis. I want to conduct a logistic 
regression on a proportion/number of success. Because I receive errors in Stata 
I did not expect nor understand (if there are Stata experts who want to know 
more about the problems I face and can potentially help me solve them, I would 
be glad to give more details), I want to repeat the analysis in R. In Stata I 
would use the command: xtlogit DEP_PROP INDEP_A INDEP_B INDEP_C, i(ID). ID is 
the identifier for each subject. There are eight lines with data for each 
subject because there are three within factors (INDEP_A, B, C) with two levels 
each (0 and 1). I can repeat this analysis in R by using the command: 
glm(DEP_SUC ~ INDEP_A + INDEP_B + INDEP_C, family = ?binomial?). DEP_SUC is 
here a table with the successes and misses per row. Again, there are eight rows 
for each subject. But while I know how to group these lines in Stata by using 
the option i(ID ), I do not know what to do in R. I have se!

ar!

  ch for more information about the i() command, but did not find any usefull 
information.

So, to summarize: I want to find out how three variables (binary) influence a 
proportion and use logistic regression. In Stata I can group multiple lines per 
subject using the i( ) command in logistic regression. What is the equivalent 
in R?

Thank you in advance!


__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Help with pedigree() function in kinship2

2015-05-13 Thread Therneau, Terry M., Ph.D.
Your problem is that PatientID, FatherID, MotherID are factors.  The authors of kinship2 
(myself and Jason) simply never thought of someone doing this.  Yes, that is an oversight. 
 We will correct it by adding some more checks and balances.  For now, turn your id 
variables into character or numeric.


Terry Therneau


On 05/13/2015 05:00 AM, r-help-requ...@r-project.org wrote:

Dear R-help,

I am interested in plotting some pedigrees and came across the kinship2
package. What follows is an example of the pedigrees I am working with
Now, when running

## check package availability
if(!require(kinship2)) install.packages('kinship2')
require(kinship2)


## data to plot
d - structure(list(FamilyID = c(1, 1, 1, 1, 1, 1, 1,
1, 1), PatientID = structure(c(2L, 3L, 5L, 11L, 12L, 15L,
16L, 17L, 6L), .Label = c( 1,  2,  3,  4,  5,  6,
 7,  9, 10, 11, 13, 14, 18, 20, 23, 24, 25,
27, 28, 29, 30, 31, 33, 34, 35, 37, 38, 39,
41, 43, 45, 50, 62, 63, 64, 65, 66, 67, 85,
88), class = factor), FatherID = structure(c(1L, 1L, 6L,
1L, 5L, 6L, 1L, 7L, 6L), .Label = c(0, 1, 10, 11, 13,
2, 23, 27, 28, 3, 33, 34, 35, 38, 5, 62,
64, 66, 9), class = factor), MotherID = structure(c(1L,
1L, 7L, 1L, 14L, 7L, 1L, 5L, 7L), .Label = c(0, 10, 18,
2, 24, 29, 3, 30, 33, 34, 39, 4, 43, 5,
6, 63, 65, 9), class = factor), Sex = structure(c(2L,
1L, 1L, 2L, 2L, 2L, 1L, 2L, 2L), .Label = c(Female, Male), class =
factor),
 AffectionStatus = structure(c(1L, 1L, 2L, 1L, 2L, 2L, 1L,
 2L, 2L), .Label = c(1, 2), class = factor)), .Names =
c(FamilyID,
PatientID, FatherID, MotherID, Sex, AffectionStatus
), row.names = c(NA, 9L), class = data.frame)

## plotting
ped - with(d, pedigree(PatientID, FatherID, MotherID, Sex,  affected =
AffectionStatus, famid = FamilyID))

## Error in pedigree(PatientID, FatherID, MotherID, Sex, affected =
AffectionStatus,  :
##Value of 'dadid' not found in the id list 1/0 1/0 1/2
1/0 1/2

I get an error.  My sessionInfo() is at the end.  I was wondering if
someone could help me to dissect what the cause of this error is and how it
can be fixed.

Thank you very much for your help.

Best regards,
Jorge Velez.-


__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Predict in glmnet for Cox family

2015-04-21 Thread Therneau, Terry M., Ph.D.



On 04/21/2015 05:00 AM, r-help-requ...@r-project.org wrote:

Dear All,

I am in some difficulty with predicting 'expected time of survival' for each
observation for a glmnet cox family with LASSO.

I have two dataset 5 * 450 (obs * Var) and 8000 * 450 (obs * var), I
considered first one as train and second one as test.

I got the predict output and I am bit lost here,

pre - predict(fit,type=response, newx =selectedVar[1:20,])

  s0
1  0.9454985
2  0.6684135
3  0.5941740
4  0.5241938
5  0.5376783

This is the output I am getting - I understood with type response gives
the fitted relative-risk for cox family.

I would like to know how I can convert it or change the fitted relative-risk
to 'expected time of survival' ?

Any help would be great, thanks for all your time and effort.

Sincerely,


The answer is that you cannot predict survival time, in general.  The reason is that most 
studies do not follow the subjects for a sufficiently long time.  For instance, say that 
the data set comes from a study that enrolled subjects and then followed them for up to 5 
years, at which time 35% had experienced mortality (using the usual Kaplan-Meier).  Fit a 
model to the data and ask what is the predicted survival time for a low risk subject. 
The answer will at best be greater than 5 years.   The program cannot say if it is 6 or 
10 or even 1000.  A bigger data set does not help.


Terry Therneau

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Coxme penalized log-likelihood mismatch

2015-04-09 Thread Therneau, Terry M., Ph.D.

The perils of backwards compatability

During computation the important quantity is  loglik + penalty.  That is what is contained 
in the third element of the loglik vector.


Originally that is also what was printed, but I later realized that for statistical 
inference one wants the loglik and penalty to be separate, e.g. the AIC is based on the 
loglik and the degrees of freedom.  So I changed the printout to give

   loglik at beta=0,  integrated loglik at solution,  ordinary loglik at 
solution
(Integrated and ordinary are the same at beta=0).

But, because I was worried that some people would have been reading the results in a 
program (like you are), I didn't want to break their code so left the internal variable 
alone.  It appears that I traded maybe safer for certain confusion.


Terry Therneau


On 04/09/2015 05:00 AM, r-help-requ...@r-project.org wrote:

Hi,?

I need to extract the penalized log-likehood term from coxme objects but I find 
the values stored whitin the object different than the penalized term given in 
the summary output of coxme function. Both the Null and the Integrated values 
are identical but the penalized is always off.?

Any thoughts on why and how i can extract the right value to compute the AIC 
myself? (I know an AIC value is given in the output but I need to compute it 
myself inside a loop)?

Many thanks,?

Alexandre Lafontaine



__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Error in thielsen

2015-03-06 Thread Therneau, Terry M., Ph.D.
I have no idea.  A data set that generates the error would be very helpful to me.  What is 
the role of the last line BTW, the one with 1% on it?
 Looking at the code I would guess that the vector tied has an NA in it, but how that 
would happen I can't see.  There is a reasonable chance that it will become obvious once I 
have an example.


Terry Therneau


On 03/06/2015 05:00 AM, r-help-requ...@r-project.org wrote:

I don?t understand an error message from a thielsen function call within a
dplyr do function call.

by.CaseNo - group_by(as.data.frame(MAP.dt), CaseNo)

MAP.thielsen - by.CaseNo %%

+   do(model = thielsen(noninvMAP ~ invMAP, symmetric = T, data = .,
+   x = T, y = T, model = T, nboot = 1000))
|   |  1% ~40 m remaining


Error in if (any(tied)) { : missing value where TRUE/FALSE needed



__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Sorting Surv objects

2015-02-13 Thread Therneau, Terry M., Ph.D.

Your work around is not as easy looking to me.

Survival times come in multiple flavors: left censored, right censored, interval censored, 
left-truncated and right censored, and multi-state.  Can you give me guidance on how each 
of these should sort?  If a sort method is added to the package it needs to deal with all 
of these.


 Professor Ripley has pointed out that the default action of sort() for right censored 
times, which I agree is reasonable.


Terry Therneau (author of the survival package)


On 02/13/2015 05:00 AM, r-help-requ...@r-project.org wrote:

It seems that Surv objects do not sort correctly.   This seems to be a bug.  
Anyone else found this?


survival.data

[1] 4+ 3  1+ 2  5+

class(survival.data)

[1] Surv

sort(survival.data)

[1] 2  1+ 4+ 3  5+

An easy work-around is to define a function sort.Surv


__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] How to read this Rpart decision tree

2015-02-11 Thread Therneau, Terry M., Ph.D.

First:
   summary(ss.rpart1)
or summary(ss.rpart, file=whatever)

The printout will be quite long since your tree is so large, so the second form may be 
best followed by a perusal of the file with your favorite text editor.  The file name of 
whatever above should be something you choose, of course.   This will give you a full 
description of the tree.  Read the first node or two very carefully so that you understand 
what the fit did.
  Plotting routines for trees have to make display choices, since there simply is not 
enough space available to list all the details.  You have a complicated endpoint with at 
least 14 different products.  The predicted value for the each node of the tree is a 
vector of percentages (one per product, adds to one); plots often show only the name of 
the most frequent.  The alive/dead endpoint for the Titanic data is a lot easier to fit 
into a little plotting oval so of course the plotted tree is easier to grasp.


Terry T.



On 02/11/2015 05:00 AM, r-help-requ...@r-project.org wrote:

Hi all,

In the attachment or this link (http://oi58.tinypic.com/35ic9qc.jpg) you'll find the 
decision tree I made. I used the Rpart package to make the tree and the rattle package 
using the fancyRpartPlot to plot it. The data in the tree looks different than about 
every example I have seen before. I don't understand how I should read it. I want to 
predict Product (which are productkeys). The variables to predict it contain age, 
incomegroup, gender, totalchildren, education, occupation, houseownerflag, numberCars.It 
looks like the upper number is a ProductKey. The n is number of observations? 
And the percentage of the yes/no question below.

This is the code I used.


ss.rpart1 - rpart(Product ~ ., data=sstrain, 
control=rpart.control(minbucket=2,minsplit=1, cp=-1))
spt - which.min(ss.rpart1$cptable[, xerror])
scp - ss.rpart1$cptable[opt, CP]
ss.rpart2 - prune(ss.rpart1, cp=cp)
fancyRpartPlot(ss.rpart2)

So why does the tree looks so different from the most (for 
example:http://media.tumblr.com/a9f482ff88b0b9cfaffca7ffd46c6a8e/tumblr_inline_mz7pyuaYJQ1s5wtly.png).
 This is from Trevor Stephen's TItanic tutorial. The first node show that 62% of 100% 
doesn't survive. If they were male, only 19% of them were survivors. I find that a lot 
examples look like that. Why does mine predict per ProductKey and every node it has 
something else. it doesn't make sense to me. And it doesn't have the two numbers like .62 
and .38 but it has n=197e+3. So should I read the first node like For 100% of the 
observations of ProductKey 1074, the incomegroup was moderate)?

Thank you!

Kim


__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] R example codes for direct standardization of rates

2015-01-07 Thread Therneau, Terry M., Ph.D.
The pyears() and survexp() routines in the survival package are designed for 
these 
calculations.
See the technical report #63 of the Mayo Biostat group for examples


http://www.mayo.edu/research/departments-divisions/department-health-sciences-research/division-biomedical-statistics-infomatics/technical-reports
 
http://www.mayo.edu/research/departments-divisions/department-health-sciences-research/division-biomedical-statistics-informatics/technical-reports

Terry Therneau

-- begin included message ---

I am looking for R  example codes to compute age-standardized death rates by 
smoking and 
psychological distress status using person-years of observation created from 
the National 
Health Interview Survey Linked Mortality Files.  Any help with the example 
codes or 
references will be appreciated.


[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] half-logistic distribution

2014-12-29 Thread Therneau, Terry M., Ph.D.



On 12/26/2014 05:00 AM, r-help-requ...@r-project.org wrote:

i want to analyse survival data using typeI HALF LOGISTIC
DISTRIBUTION.how can i go about it?it installed one on R in the
survival package didn't include the distribution...or i need a code to
use maximum likelihood to estimate the parameter in survival
analysis.a typical example of distribution other than that installed
in R will help.thanks



I am the author of the survival package, and had never heard of the type I half-logistic 
before.
New distributions can be added to survreg() if they can be represented as location-scale 
families; I don't think that your distribution can be written in that way.


Look at survival under the Task Views tab (upper left corner) of the cran.org web page 
-- someone else may have already done it.


Terry T.

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Cox model with multiple events - PH assumption

2014-12-23 Thread Therneau, Terry M., Ph.D.



On 12/23/2014 05:00 AM, r-help-requ...@r-project.org wrote:

Dear all,

I'm using the package survival for adjusting the Cox model with multiple
events (Prentice, Williams and Peterson Model). I have several covariates,
some of them are time-dependent.

I'm using the functioncox.zph to check the proportional hazards. Due to
the nature of the time-dependent covariates, I don't need to analyse the
assumptions of the proportional hazards associated with the
time-dependent covariates. Is it right?

?Thanks for your attention.

Best regards,
Helena.


Wrong.
The PH assumption is that the same coefficient b for a covariate applies over all 
follow-up time.  The fact that the covariate itself changes value, or does not change 
value, over time has no bearing on whether the assumption is true.


Now it may often be the case that risk is related to current covariate values, and a Cox 
model using baseline values fails just because the covariates are out of date.  So PH 
might hold for the updated (time-dependent) covariates and fail when using baseline 
values.  This is a very study specific situation, however.


Terry T.

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] exclude missing co-variable data in cox model

2014-12-19 Thread Therneau, Terry M., Ph.D.

Three responses to your question
  1. Missing values in R are denoted by NA.  When reading in your data you want to use 
the na.strings option so that the internal form of the data has missing values properly 
denoted.


  2. If this is done, then coxme will notice the missings and remove them, you do not 
need to do anything.  Your second question of how to use the missing data is a much 
deeper statistical one.  Multiple imputation would be the obvious way to proceed, but it 
is complex and I have no experience with respect to how it would work with a mixed effects 
Cox model.


  3. Your note implies that output was attached.  Note that r-help strips away all 
attachments, so none of us saw it.


Terry Therneau


On 12/19/2014 05:00 AM, r-help-requ...@r-project.org wrote:

Hi all,

I have a data set like this:

Test.cox file:

V1V2 V3Survival   Event
ann  13  WTHomo  41
ben  20  *51
tom  40 Variant   61

where * indicates that I don't know what the value is for V3 for Ben.

 Remainder of message excluded


__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] How to pass a nested random effect with frailty in coxph()

2014-12-12 Thread Therneau, Terry M., Ph.D.

Use the coxme funtion (package coxme), which has the same syntax as lme4.
The frailty() function in coxph only handles the simple case of a random 
intercept.

Terry Therneau


On 12/12/2014 05:00 AM, r-help-requ...@r-project.org wrote:

Hi,
I have a very simple Cox regression model in which I need to include a
nested random effect: individual nested in treatment. I know how to pass a
single random effect - e.g. frailty(id)- but how can I specify the nested
random (id nested in treatment) effect using frailty?
The equivalent in lme4 would be (treatment|id)

Thanks!

David


__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] CMD check error

2014-11-18 Thread Therneau, Terry M., Ph.D.
I have a new package (local use only).  R CMD check fails with a messge I 
haven't seen before, and I haven't been able to guess the cause.
There are two vignettes, both of which have %\VignetteIndexEntry lines.

Same failure both under R-3.1.1 and R-devel, so it's me and not R.  Linux OS.

Hints anyone?

Terry Therneau

=

tmt% R CMD build dart
* preparing 'dart':
* checking DESCRIPTION meta-information ... OK
* installing the package to build vignettes
* creating vignettes ... OK
* checking for LF line-endings in source and make files
* checking for empty or unneeded directories
* looking to see if a 'data/datalist' file should be added
* building 'dart_1.0-2.tar.gz'

tmt% R CMD check dart*gz
...
Installation failed.
See '/people/biostat2/therneau/consult/bsi/dart.Rcheck ...

tmt% more dart.Rcheck/00install.out
...
** installing vignettes
Warning in file(con, w) :
   cannot open file '/people/biostat2/therneau/consult/bsi/dart.Rcheck/dart/doc/
index.html': No such file or directory
Error in file(con, w) : cannot open the connection
ERROR: installing vignettes failed
* removing '/people/biostat2/therneau/consult/bsi/dart.Rcheck/dart'


[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] CMD check error

2014-11-18 Thread Therneau, Terry M., Ph.D.

[Picture of a hand slapping forehead]
I suspected it to be something either dumb or obvious, and managed both.

This library plugs into the tail of a long developement project for a web API, which had 
prior documentation, which I stuck in doc.  And then added to .Rbuildignore.


Thanks Hadley

Terry T.


On 11/18/2014 08:47 AM, Hadley Wickham wrote:

Do you have a .Rbuildignore? If so, what's in it?
Hadley

On Tue, Nov 18, 2014 at 7:07 AM, Therneau, Terry M., Ph.D.
thern...@mayo.edu wrote:

I have a new package (local use only).  R CMD check fails with a messge I 
haven't seen before, and I haven't been able to guess the cause.
There are two vignettes, both of which have %\VignetteIndexEntry lines.

Same failure both under R-3.1.1 and R-devel, so it's me and not R.  Linux OS.

Hints anyone?

Terry Therneau

=

tmt% R CMD build dart
* preparing 'dart':
* checking DESCRIPTION meta-information ... OK
* installing the package to build vignettes
* creating vignettes ... OK
* checking for LF line-endings in source and make files
* checking for empty or unneeded directories
* looking to see if a 'data/datalist' file should be added
* building 'dart_1.0-2.tar.gz'

tmt% R CMD check dart*gz
...
Installation failed.
See '/people/biostat2/therneau/consult/bsi/dart.Rcheck ...

tmt% more dart.Rcheck/00install.out
...
** installing vignettes
Warning in file(con, w) :
cannot open file 
'/people/biostat2/therneau/consult/bsi/dart.Rcheck/dart/doc/
index.html': No such file or directory
Error in file(con, w) : cannot open the connection
ERROR: installing vignettes failed
* removing '/people/biostat2/therneau/consult/bsi/dart.Rcheck/dart'


 [[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.






__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] cluster + tt terms in coxph

2014-11-05 Thread Therneau, Terry M., Ph.D.
This is fixed in version 2.37-8 of the survival package, which has been in my send to 
CRAN real-soon-now queue for 6 months.  Your note is a prod to get it done.  I've been 
updating and adding vignettes.


Terry Therneau


On 11/05/2014 05:00 AM, r-help-requ...@r-project.org wrote:

I am receiving the following error when trying to include both tt (time 
transforms) and frailty terms in coxph


coxph(Surv(time, status) ~ ph.ecog + tt(age)+cluster(sex), data=lung,

+  tt=function(x,t,...) pspline(x + t/365.25))
Error in residuals.coxph(fit2, type = dfbeta, collapse = cluster, weighted = 
TRUE) :
   Wrong length for 'collapse'

I tried both 64 bit (R.3.1.0) and 32 bit (R.3.1.2) in Windows 7 64bit and get 
the same errors

Inclusion of tt and cluster terms worked fine in R2.9.2-2.15.1 under Windows 
Vista 32 bit and Ubuntu 64 bit

Any ideas?


__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] merge by time, certain value if 5 min before and after an event

2014-10-03 Thread Therneau, Terry M., Ph.D.
I've attached two functions used locally.  (The attachments will be stripped off of the 
r-help response, but the questioner should get them).  The functions neardate and 
tmerge were written to deal with a query that comes up very often in our medical 
statistics work, some variety of get the closest creatinine value to the subject's date 
of rehospitalization, at least one week before but no more than 1 year prior, or tasks 
that merge two data sets to create a single (start, stop] style one.


The neardate function is a variant on match().  Given two (id, date) pairs it will find 
the first pair in list 2 that has date2 = date1 (or =) and the same id.  The second 
variable can be any orderable class, but dates are the most common use and hence the name.


These are being added to the survival package release that should be out real-soon-now, 
once I add some extended examples of their use to the time dependent covariates vignette.


Terry Therneau

On 10/03/2014 05:00 AM, r-help-requ...@r-project.org wrote:

Hello! I hope someone can help me. It would save me days of work. Thanks in
advance!
I have two dataframes which look like these:


myframe - data.frame (Timestamp=c(24.09.2012 09:00:00, 24.09.2012
10:00:00,
24.09.2012 11:00:00), Event=c(low,high,low) )
myframe


mydata - data.frame ( Timestamp=c(24.09.2012 09:05:01, 24.09.2012
09:49:50, 24.09.2012 09:51:01, 24.09.2012 10:04:50, 24.09.2012
10:05:10)
, location=c(1,2,3,1,5) )
mydata


# I want to merge them by time so I have a dataframe which looks like this
in the end (i.e. Low  during 5 min before and after high )

result - data.frame ( Timestamp=c(24.09.2012 09:05:01, 24.09.2012
09:49:50, 24.09.2012 09:51:01, 24.09.2012 10:04:50, 24.09.2012
10:05:10)
, location=c(1,2,3,1,5) ,
Event=c(low, low,high,high,low))
result

Anyone knows how do merge them?
Best regards,
Dagmar

#
# Create a nearest date index
#  date1: the trial date
#  date2: target to match to
#
# result: an index vector for data set 1, which shows the row in data set
#  2 that has the same id, and the best date.
#
# best = after  The closest date in #2 that is on or after the date in #1
#prior  The closest date in #2 that is on or before the date in #1
# 
neardate - function(date1, date2, id1, id2, best=c(after, prior)) {
if (!missing(id1)) {
if (length(id1) != length(date1))
stop(id1 and date1 have different lengths)
if (missing(id2))
stop(either both or neither of id1 and id2 must be supplied)
if (class(id1) != class(id2)  !(is.numeric(id1)  is.numeric(id2)))
stop(id1 and id2 are different data types)
if (length(id2) != length(date2))
stop(id2 and date2 have different lengths)
}
else if (!missing(id2))
stop(either both or neither of id1 and id2 must be supplied)

if (class(date1) != class(date2)  !(is.numeric(date1)  is.numeric(date2)))
stop(date1 and date2 are diffrent data types)

if (is.factor(date1)  !is.ordered(date1)) 
stop(date1 cannot be a factor)

date1 - as.numeric(date1)  # this will fail for silly inputs like a list
date2 - as.numeric(date2)  
best - match.arg(best)

# Throw out missing dates in the second arg, but remember which ones
rowid - 1:length(date2)
if (any(is.na(date2))) {
toss - is.na(date2)
date2 - date2[!toss]
if (!missing(id2)) id2 - id2[!toss]
rowid - rowid[!toss]
}
n2 - length(date2)
if (n2 ==0) stop(No valid entries in data set 2)

# Simpler case: no id variables
rowid - 1:length(date2)
if (missing(id1)) {
if (best==prior)
indx2 - approx(date2, 1:n2, date1, method=constant, yleft=NA,
yright=n2, rule=2, f=0)$y
else 
indx2 - approx(date2, 1:n2, date1, method=constant, yleft=1,
yright=NA, rule=2, f=1)$y
return(rowid[indx2])
}

# match id values as well
#   First toss out any rows in id2 that are not possible targets for id1
#   (id2 is usually the larger data set, thinning speeds it up)
indx1 - match(id2, id1)
toss - is.na(indx1) 
if (any(toss)) {
id2 - id2[!toss]
date2 - date2[!toss]
indx1 - indx1[!toss]
rowid - rowid[!toss]
}

n2 - length(date2)
if (n2 ==0) stop(No valid entries in data set 2)

# We need to create a merging id.  A minimal amount of
#  spread for the dates keeps numeric overflow at bay
alldate - sort(unique(c(date1, date2)))
date1 - match(date1, alldate)
date2 - match(date2, alldate)
delta - 1.0 + length(alldate)  #numeric, not integer, on purpose
hash1 - match(id1, id1)*delta + date1
hash2 - indx1*delta + date2 

if (best==prior)
indx2 - approx(hash2, 1:n2, hash1, method=constant, yleft=NA,
yright=n2, rule=2, f=0)$y
else 
indx2 - approx(hash2, 1:n2, hash1, 

[R] c() and dates

2014-10-03 Thread Therneau, Terry M., Ph.D.

I'm a bit puzzled by a certain behavior with dates.  (R version 3.1.1)

 temp1 - as.Date(1:2, origin=2000/5/3)
 temp1
[1] 2000-05-04 2000-05-05

 temp2 - as.POSIXct(temp1)
 temp2
[1] 2000-05-03 19:00:00 CDT 2000-05-04 19:00:00 CDT

So far so good.  On 5/4, midnight in Greenwich it was 19:00 on 5/3 in my time zone.  The 
manual page has a clear explanation of what goes on.


 c(temp1, temp2)
[1] 2000-05-042000-05-052623237-10-15 2623474-05-06
 class(c(temp1, temp2))
[1] Date

 c(temp2, temp1)
[1] 2000-05-03 19:00:00 CDT 2000-05-04 19:00:00 CDT
[3] 1969-12-31 21:04:41 CST 1969-12-31 21:04:42 CST
 class(c(temp2, temp1))
[1] POSIXct POSIXt

I would have expected c() to determine a common class, somehow, then do the conversion and 
concatonate.  That is obviously not what happens.  I've read the manual page but I must be 
missing something.  I make no claim that R is broken, mistaken, or otherwise deficient, 
only that my understanding is so.


Could someone illuminate?

Terry T.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] c() and dates

2014-10-03 Thread Therneau, Terry M., Ph.D.
Well duh -- type c.Date at the command prompt to see what is going on.  I suspected I 
was being dense.


Now that the behaior is clear can I follow up on David W's comment that redfining the 
c.Date function as


structure(c(unlist(lapply(list(...), as.Date))), class = Date)

allows for a more intellegent response, since it allows all of the as.Date machinery to be 
brought into play.


It seems like a good idea in general.  Would it be a good exchange between the current 
nonsense result, no warning and the new error messages that would arise, e.g., from 
c(as.Date(2000/10/1), factor('b')).




Terry T.

On 10/03/2014 09:52 AM, peter dalgaard wrote:

S3 only has single dispatch, so in one case it dispatches to c.Date and in the 
other to c.POSIXct, both of those return an object of the corresponding class. 
In both cases, the arguments pass through

c(unlist(lapply(list(...), unclass)))

which doesn't look at the class at all. Since Date objects unclass to days and 
POSIXct to seconds, something is bound to go wrong.


__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] CoxME: Family relatedness

2014-09-15 Thread Therneau, Terry M., Ph.D.

I would have caught this tomorrow (I read the digest).
Some thoughts:

1. Skip the entire step of subsetting the death.kmat object.  The coxme function knows how 
to do this on its own, and is more likely to get it correct.  My version of your code would be

  deathdat.kmat - 2* with(deathdat, makekinship(famid, id, faid, moid))
  model3 - coxme(Surv(Survival, Event) ~ Sex + strata(cohort) + SNP1 + SNP2 + 
SNP3 + (1|id),
data=death.dat, varlist=deathdat.kmat)


This all assumes that the id variable is unique.  If family 3 and family 4 both have an 
id of 1, then the coxme call can't match up rows in the data to rows/cols in the kinship 
matrix uniquely.  But that is simple to fix.
The kinship matrix K has .5 on the diagonal, by definition, but when used as a correlation 
most folks prefer to use 2K.  (This causes mixups since some software adds the 2 for 
you, but coxme does not.)


2. The model above is the correct covariance structure for a set of families.  There is a 
single intercept per subject, with a complex correlation matrix.  The simpler per family 
frailty model would be


model4 - coxme(Surv(Survival, Event) ~ Sex + strata(cohort) + SNP1 + SNP2 + SNP3 + 
(1|famid), death.dat)


This model lets each family have a separate risk, with everyone in the same family sharing 
the exact same risk.  It is less general than model3 above which lets a family have higher 
risk plus has variation between family members.


A model with both per-subject and per family terms is identical to one with a covariance 
matrix of s1 K + s2 B, where K is the kinship matrix, B is a block diagonal matrix which 
has a solid block of 1 for each family, and s1 s2 are the fitted variance coefficients.


  I don't find this intuitive, but have seen the argument that B is a shared 
environmental effect.  (Perhaps because I have large family trees where they do not all 
live together).  If you want such a model:

   model5 - coxme(.. + (1|id) + (1|famid), death.dat, 
varlist=deathdat.kmat)

(When the varlist is too short the program uses the default for remaining 
terms).

3. To go further you will need to tell us what you are trying to model, as math formulas 
not as R code.


4. The error messages you got would more properly read I'm confused on the part of the 
program.  They cases of something I would never do, so never got that message.  Therefore 
useful for me to see.  Continuous variables go to the left of the | and categoricals to 
the right of the |.  Having a family id to the left makes no sense at all.


Terry Therneau


On 09/15/2014 03:20 PM, Marie Dogherty wrote:

Dr. Therneau,

I was wondering if you had a spare minute, if you could view a post in the R 
forum:

http://r.789695.n4.nabble.com/CoxME-family-relatedness-td4696976.html

I would appreciate it, I'm stuck and out of ideas!

Many thanks

Marie.


---Original post --

Hello all,

I have a table like this, with ~300 individuals:

Famid Id  Faid Moid Cohort  Sex  Survival Event SNP1 SNP2 SNP3

11000010   1010

22001120   1000

23000025   1010

45120035   1011

46120035   0101



famid=family id, id=id of person,faid=father id,moid=mother id.

My question of interest: What impact does SNP1/SNP2/SNP3 have on survival of individuals 
(Id), after accounting for possible effects due to family relatedness (Famid).


So I want to account for family-specific frailty effects, and individual frailty effects 
according to degree of relationship between individuals.


The commands I've used are from this paper: 
http://www.ncbi.nlm.nih.gov/pubmed/21786277

Library(survival)
Library(coxme)
Library(kinship2)
Library(bdsmatrix)

Death.dat -read.table(“Table”,header=T)

#Make a kinship matrix for the whole study
deathdat.kmat 
-makekinship(famid=death.dat$famid,id=death.dat$id,father=death.dat$faid,mother=death.dat$moid)


##omit linker individuals with no phenotypic data, used only to ##properly specify the 
pedigree when calculating kinship ##coefficints:

death.dat1-subset(death.dat,!is.na(Survival))

##temp is an array with the indices of the individuals with Survival years:
all -dimnames(deathdat.kmat)[[1]]
temp -which(!is.na(death.dat$Survival[match(all,death.dat$id)]))

##kinship matrix for the subset with phenotype Survival:
deathdat1.kmat -deathdat.kmat[temp,temp]


If I type:

model3 
-coxme(Surv(Survival,Event)~Sex+strata(Cohort)+SNP1+SNP2+SNP3+(1+id|famid),data=death.dat,varlist=deathdat1.kmat)


I get:

“Error in coxme(Surv(Survival, Event) ~ Sex + strata(Cohort) + SNP1 +  :
  In random term 1: Mlist cannot have both covariates and grouping”


Whereas if I type:

model3 
-coxme(Surv(Survival,Event)~Sex+strata(Cohort)+SNP1+SNP2+SNP3,(id|famid),data=death.dat1,varlist=deathdat1.kmat)



I get:

Error in coxme(Surv(Survival, Event) ~ Sex + 

Re: [R] Linear relative rate / excess relative risk models

2014-08-28 Thread Therneau, Terry M., Ph.D.



On 07/30/2014 05:00 AM, r-help-requ...@r-project.org wrote:

A while ago, I inquired about fitting excess relative risk models in R. This is 
a follow-up about what I ended up doing in case the question pops up again.

While I was not successful in using standard tools, switching to Bayesian 
modeling using rstan (mc-stan.org/rstan.html) worked better. The results 
closely match those from Epicure.

Using the data here:http://dwoll.de/err/dat.txt
The stan model fit below replicates the results from Epicure 
here:http://dwoll.de/err/epicure.log

Of course I am still interested in learning about other options or approaches.

Daniel



A good paper on issues and solutions is Lumley, Kronmal and Ma,  2006,
Relative Risk Regression in Medical Research: Models, Contrasts, Estimators, 
and Algorithms

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] plot for survreg and intcox (rewritten)

2014-08-27 Thread Therneau, Terry M., Ph.D.

I missed this question.

1. For survreg.
  help(predict.survreg) shows an example of drawing a survival curve

Adding a survfit method has been on my list for a long time, since it would make this 
information easier to find.



2. intcox.  I had not been familiar with this function.  Even though the authors have 
decided to label the result with a coxph class, presumably so the coxph print function 
will work, almost none of the coxph methods will work on the result.
  However, reading their manual page it seems that the hazard and its time points are 
returned as a part of the result.  I tried running an example to see if the result is a 
hazard or a cumulative hazard but the run failed and it's time to run to a meeting. 
Assume it is cumulative hazard.  Then exp(-fit$lambda0) will give the baseline survival curve.


Terry T


On 08/27/2014 02:47 PM, David Winsemius wrote:

On Aug 26, 2014, at 3:34 PM, Silong Liao wrote:


Hello,

My data is interval censoring, and I use intcox package which deals with this 
situation.

I guess survest is only for survfit, and I would like to plot parametric model 
survreg.Since I use dist=weibull, I tried curve function but plot seems 
unreasonable.


mod.reg1=survreg(s_new~type+sex+eye+age+preopiop+preopva,dist=weibull)
intercept=-19.155
scale=8.12
curve(pweibull(x,scale=exp(coef(mod.reg1)),shape=1/mod.reg1$scale,lower.tail=FALSE),from=0,to=40)



You are asked to reply to the list. ( I am taking the liberty of forwarding to 
Terry Therneau since I don't want to misquote him.)

I remember Terry Therneau saying that plotting survival curves from 
interval-censored data seemed difficult at best and nonsensical at worst. (If 
it's difficult for him it's likely to be impossible for me.) I don't know if 
the assumption of a parametric form might aid in that effort.

I also remember warnings (in both the documentation and repeated many times by 
Therneau on rhelp) that the parameters of the weibull distribution used in the 
survival package were different than those used in the stats package weibull 
function.

-- David.



Subject: Re: [R] plot for survreg and intcox (rewritten)
From:dwinsem...@comcast.net
Date: Tue, 26 Aug 2014 14:57:16 -0700
CC:r-help@r-project.org
To:silong.l...@hotmail.com


On Aug 26, 2014, at 2:33 PM, Silong Liao wrote:


Dear R users,

I'm trying to plot survival probability against time(in years) using survreg and 
intcox. Please can you help me with this problem? (I have rewritten using plain text.) I tried to use 
curve function but have no clue.



I suspect you want survfit (in the survial package which is where I suspect 
survreg is coming from. It returns an object that has a plot method. You could also 
scroll through help(pack=survival) to see other plotting functions.

You could also use survest in the rms package.


For survreg,

mod.reg1=survreg(s_new~type+sex+eye+preopiop+preopva,dist=weibull)
summary(mod.reg1)

Call:
survreg(formula = s_new ~ type + sex + eye + preopiop + preopva, dist = 
weibull)
Value Std. Error z p
(Intercept) 40.539 20.582 1.970 4.89e-02
typeTrab -6.606 4.279 -1.544 1.23e-01
sexM -1.055 3.765 -0.280 7.79e-01
eyeR -2.112 3.587 -0.589 5.56e-01
preopiop -0.308 0.269 -1.147 2.52e-01
preopva -0.461 1.771 -0.260 7.95e-01
Log(scale) 2.058 0.285 7.222 5.12e-13

Scale= 7.83
Weibull distribution
Loglik(model)= -78.7 Loglik(intercept only)= -81.4
Chisq= 5.37 on 5 degrees of freedom, p= 0.37
Number of Newton-Raphson Iterations: 10
n= 339

For intcox,


You are asked to provide the package name for functions that are not in the 
base or default packages. I have quite a few packages loaded including 
survival_2.37-7 , coxme_2.2-3, and rms_4.2-0 but I get:


?intcox

No documentation for ‘intcox’ in specified packages and libraries:
you could try ‘??intcox’




--
David.




cox.fit=intcox(s_new~type+eye+sex+age+preopiop+preopva,data=glaucoma_new)
summary(cox.fit)

Call:
intcox(formula = s_new ~ type + eye + sex + age + preopiop +preopva, data = 
glaucoma_new)

n= 339

coef exp(coef) se(coef) z Pr(|z|)
typeTrab 0.59391 1.81106 NA NA NA
eyeR 0.28419 1.32868 NA NA NA
sexM -0.11597 0.89050 NA NA NA
age -0.06556 0.93655 NA NA NA
preopiop 0.03903 1.03980 NA NA NA
preopva -0.05517 0.94632 NA NA NA

exp(coef) exp(-coef) lower .95 upper .95
typeTrab 1.8111 0.5522 NA NA
eyeR 1.3287 0.7526 NA NA
sexM 0.8905 1.1230 NA NA
age 0.9365 1.0678 NA NA
preopiop 1.0398 0.9617 NA NA
preopva 0.9463 1.0567 NA NA

Rsquare= NA (max possible= 0.327 )
Likelihood ratio test= NA on 6 df, p=NA
Wald test = NA on 6 df, p=NA
Score (logrank) test = NA on 6 df, p=NA

__
R-help@r-project.org  mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guidehttp://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


David Winsemius
Alameda, CA, USA


  Rplot.jpeg


Re: [R] Cox regression model for matched data with replacement

2014-08-13 Thread Therneau, Terry M., Ph.D.



On 08/13/2014 05:00 AM, John Purda wrote:

I am curious about this problem as well. How do you go about creating the 
weights for each pair, and are you suggesting that we can just incorporate a 
weight statement in the model as opposed to the strata statement? And Dr. 
Therneau, let's say I have 140 cases matched with replacement to 2 controls. Is 
my id variable the number of cases?



 The above has an incorrect assumption that I notice ALL survival questions on the list 
-- which was false in this case.  Could you clue me in as to the original question and 
discussion -- assuming that you want Dr Therneau to respond intelligently :-)


Terry T.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Cox regression model for matched data with replacement

2014-08-13 Thread Therneau, Terry M., Ph.D.

Ok, I will try to do a short tutorial answer.

1. The score statistic for a Cox model is a sum of (x - xbar), where x is the covariate 
vector of the subject who had an event, and xbar is the mean covariate vector for the 
population, at that event time.

  - the usual Cox model uses the mean of {everyone still at risk} as xbar
  - matched Cox models use a mean of {some subset of those at risk}, and work fine as 
long as that subset is an honest estimate of xbar.  You do, of course, have to sample from 
those still at risk at the time point, since that is the xbar you are trying to estimate. 
 Someone who dies or is censored at time 10 can't be a control at time 20.
  - in an ordinary Cox model the program figures out who belongs in each xbar average all 
on its own, using the time variable.  In a matched model you need to supply the who 
dances with who information.  The usual way is to assign each of the sets {subject who 
died + their controls} to a separate stratum.  (If there is only one death in each stratum 
then the time variable will not be needed and you can plug in a dummy value; this is what 
clogit does.)  You can have more than one control per case by the way.


2. Variance.  In the matched model you run the risk, a quite small risk, that the same 
person would be picked again and again as the control.  If this unfortunate thing were to 
happen then the usual model based variance would be too optimistic --- because of its 
overdependence on one single subject the fit is more unstable than it looks.  Three 
solutions: a) don't worry about it (my usual approach),  b) when selecting controls, 
ensure that this doesn't happen (classic matched case control),  c) use a robust variance. 
 For the latter make sure that each subject in the data set has a unique value for some 
variable id and add + cluster(id) to the model statement.


3. The most common mistake in matching is to exclude, at a given death time t, any subject 
with a future event from the list of potential controls at time t.  This does not lead to 
an unbiased estimate of xbar, and the resulting numerical bias in the coefficients is 
shockingly large.
  There are more clever ways to pick the subset at each event time, e.g., if you had some 
prior information on all the subjects that can classify them into high/medium/low risk. 
Survey sampling principles come into play for selection and the xbar at each time is 
replaced with an appropriate weighted survey estimate.  See various papers by Brian Langholz.


Terry T


On 08/13/2014 07:26 AM, John Pura wrote:

Hi Dr. Therneau,

The original question on the forum was:

My problem was how to build a Cox model for the matched data (1:n) with
replacement. Usually, we can use stratified Cox regression model when the
data were matched without replacement. However, if the data were matched
with replacement, due to the re-use of subjects, we should give a weight
for each pair, then how to incorporate this weight into a Cox model. I also
checked the clogit function in survival package, it seems suitable to the
logistic model for the matched data with replacement, rather than Cox
model. Because it sets the time to a constant. Anyone can give me some
suggestions?

I’m facing a very similar situation, in which I have multiple controls to 
multiple cases.
How would I go about taking that dependency into account in a Cox model? Is 
this weighting
appropriate and to get robust sandwich estimates, can I take my id variable to 
be the id
for the unique cases?

Thanks,

John



On 08/13/2014 05:00 AM, John Purda wrote:


I am curious about this problem as well. How do you go about creating the 
weights for each pair, and are you suggesting that we can just incorporate a 
weight statement in the model as opposed to the strata statement? And Dr. 
Therneau, let's say I have 140 cases matched with replacement to 2 controls. Is 
my id variable the number of cases?








  The above has an incorrect assumption that I notice ALL survival questions on 
the list

-- which was false in this case.  Could you clue me in as to the original 
question and

discussion -- assuming that you want Dr Therneau to respond intelligently :-)



Terry T.



__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Cox regression model for matched data with replacement

2014-08-13 Thread Therneau, Terry M., Ph.D.



On 08/13/2014 08:38 AM, John Pura wrote:

Thank you for the reply. However, I think I may not have clarified what my 
cases are. I'm studying the effect of radiation treatment (vs. none) on 
survival. My cases are patients who received radiation and controls are those 
who did not. I used a propensity score model to match cases to controls in a 
1:2 fashion. However, because the matching was done with replacement, some 
controls were matched to more than one case. How can I go about analyzing this 
- would frequency weighting work?

Thanks,
John


We went down the wrong path.  When people use the word case it almost always refers to 
subjects who had the outcome.  If I read the above correctly you have the more simple 
situation of subset selection.  Subjects were chosen to be in the model without reference 
to their outcome status, with the goal of balancing treatment wrt other predictive 
factors.  Correct?   If so, my preferred modeling strategy, in order.


1. coxph(Surv(time, status) ~ treatment, data=one)
  Where data set one has one copy of each subject selected to be in the study.  If they 
were nominated twice they still appear once.  Optional: give each control a case weight 
equal to the number of times they were selected.  This will better balance the data set 
wrt the factors.


2. Same model, with covariates.  The argument about whether covariates on which you have 
balanced should be included in the model is as old the hills --- belt AND suspenders? 
--- with proponents on both sides.  Meh.  Unless there are too many of course. I still 
like the 10-20 events per covarate rule to choose the maximum number of predictors.


3. coxph(Surv(time, status) ~ treatment + strata(group), data=two)
 I veiw this as model 2 with paranoia.  The covariate effects are so odd that we'll 
never model them correctly, so treat each combination as unique.   The data set two needs 
to have each treated subject + their controls in a separate stratum.  Even though some 
controls are in the data set twice, they don't need case weights since they are in any 
given stratum only once.


For any  of the above you can add a robust variance.  Required if case weights 
are used.

Terry T

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Survival Analysis with an Historical Control

2014-07-10 Thread Therneau, Terry M., Ph.D.

You are asking for a one sample test.  Using your own data:

connection - textConnection(
GD2  1   8 12  GD2  3 -12 10  GD2  6 -52  7
GD2  7  28 10  GD2  8  44  6  GD2 10  14  8
GD2 12   3  8  GD2 14 -52  9  GD2 15  35 11
GD2 18   6 13  GD2 20  12  7  GD2 23  -7 13
GD2 24 -52  9  GD2 26 -52 12  GD2 28  36 13
GD2 31 -52  8  GD2 33   9 10  GD2 34 -11 16
GD2 36 -52  6  GD2 39  15 14  GD2 40  13 13
GD2 42  21 13  GD2 44 -24 16  GD2 46 -52 13
GD2 48  28  9  GD2  2  15  9  GD2  4 -44 10
GD2  5  -2 12  GD2  9   8  7  GD2 11  12  7
GD2 13 -52  7  GD2 16  21  7  GD2 17  19 11
GD2 19   6 16  GD2 21  10 16  GD2 22 -15  6
GD2 25   4 15  GD2 27  -9  9  GD2 29  27 10
GD2 30   1 17  GD2 32  12  8  GD2 35  20  8
GD2 37 -32  8  GD2 38  15  8  GD2 41   5 14
GD2 43  35 13  GD2 45  28  9  GD2 47   6 15
)

hsv - data.frame(scan(connection, list(vac=, pat=0, wks=0, x=0)))
hsv - transform(hsv, status= (wks 0), wks = abs(wks))

fit1 - survreg(Surv(wks, status) ~ 1, data=hsv, dist='exponential')
temp - predict(fit1, type='quantile', p=.5, se=TRUE)

c(median= temp$fit[1], std= temp$se[1])
  medianstd
24.32723  4.36930

--
The predict function gives the predicted median survival and standard deviation for each 
observation in the data set.  Since this was a mean only model all n of them are the same 
and I printed only the first.
For prediction they make the assumption that the std error for my future study will be the 
same as the std from this one, you want the future 95% CI to not include the value of 16, 
so the future mean will need to be at least 16 + 1.96* 4.369.


A nonparmetric version of the argument would be


fit2 - survfit(Surv(wks, status) ~ 1, data=hsv)
print(fit2)

records   n.max n.start  events  median 0.95LCL 0.95UCL
 48  48  48  31  21  15  35

Then make the argument that in our future study, the 95% CI will stretch 6 units to the 
left of the median, just like it did here.  This argument is a bit more tenuous though. 
The exponential CI width depends on the total number of events and total follow-up time, 
and we can guess that the new study will be similar.  The Kaplan-Meier CI also depends on 
the spacing of the deaths, which is less likely to replicate.


Notes:
 1. Use summary(fit2)$table to extract the CI values.  In R the print functions don't 
allow you to grab what was printed, summary normally does.
 2. For the exponential we could work out the formula in closed form -- a good homework 
exercise for grad students perhaps but not an exciting way to spend my own afternoon.  An 
advantage of the above approach is that we can easily use a more realistic model like the 
weibull.
 3. I've never liked extracting out the Surv(t,s) part of a formula as a separate 
statement on another line.  If I ever need to read this code again, or even just the 
printout from the run, keeping it all together gives much better documentation.
 4. Future calculations for survival data, of any form, are always tenuous since they 
depend critically on the total number of events that will be in the future study.  We can 
legislate the total enrollment and follow-up time for that future study, but the number of 
events is never better than a guess.  Paraphrasing a motto found on the door of a well 
respected investigator I worked with 30 years ago (because I don't remember it exaclty):


  The incidence of the condition under consideration and its subsequent death rate will 
both drop by 1/2 at the commencement of a study, and will not return to their former 
values until the study finishes or the PI retires.



Terry T.

---

On 07/10/2014 05:00 AM, r-help-requ...@r-project.org wrote:

Hello All,

I'm trying to figure out how to perform a survival analysis with an historical 
control. I've spent some time looking online and in my boooks but haven't found 
much showing how to do this. Was wondering if there is a R package that can do 
it, or if there are resources somewhere that show the actual steps one takes, 
or if some knowledgeable person might be willing to share some code.

Here is a statement that describes the sort of analyis I'm being asked to do.

A one-sample parametric test assuming an exponential form of survival was used 
to test the hypothesis that the treatment produces a median PFS no greater than 
the historical control PFS of 16 weeks.  A sample median PFS greater than 20.57 
weeks would fall beyond the critical value associated with the null hypothesis, 
and would be considered statistically significant at alpha = .05, 1 tailed.

My understanding is that the cutoff of 20.57 weeks was obtained using an online 
calculator that can be found at:

http://www.swogstat.org/stat/public/one_survival.htm

Thus far, I've been unable to determine what values were plugged into the 
calculator to get the cutoff.

There's another calculator for a nonparamertric test that can be found at:


Re: [R] Predictions from coxph or cph objects

2014-07-07 Thread Therneau, Terry M., Ph.D.

I've been off on vacation for a few days and so am arriving late to this 
discussion.

 Try ?print.survfit, and look at the print.rmean option and the discussion thereof in the 
Details section of the page.  It will answer your question, in more detail than you 
asked.  The option applies to survival curves from coxph fits as well.


Short summary
1. For any positive random variable X, mean(x) = integral from 0 to inf of the survival 
curve.  For some reason I found this particular homework problem impossible, back when I 
was a new grad student, consequently I remember it well.


2. When there is censoring the survival curve never drops to zero, which makes the full 
integral not evaluable.  There are well known responses to this problem, but the mean 
survival is used by so few that these well known approaches are only known by a small 
few. Enough people got confused by the resulting truncated mean that I set the default 
option for print.rmean to don't print it.  The downside is that the few who do want a 
mean (like you) often have trouble discovering how to obtain it.


A reference manual (with index) for the survival package is sorely needed.  
Someday...

Terry Therneau





Dear R users,

My apologies for the simple question, as I'm starting to learn the concepts
behind the Cox PH model. I was just experimenting with the survival and rms
packages for this.

I'm simply trying to obtain the expected survival time (as opposed to the
probability of survival at a given time t). I can't seem to find an option
from the type argument in the predict methods from coxph{survival} or
cph{rms} that will give me expected survival times.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] standard error of survfit.coxph()

2014-06-30 Thread Therneau, Terry M., Ph.D.
1. The computations behind the scenes produce the variance of the cumulative hazard. 
This is true for both an ordinary Kaplan-Meier and a Cox model.  Transformations to other 
scales are done using simple Taylor series.


  H = cumulative hazard = log(S);  S=survival
  var(H) = var(log(S))  = the starting point
  S = exp(log(S)), so  var(S) is approx [deriv of exp(x)]^2 * var(log(S)) = S^2 
var(H)
  var(log(log(S)) is approx (1/S^2) var(H)

2. At the time it was written, summary.survfit was used only for printing out the survival 
curve at selected times, and the audience for the printout wanted std(S).   True, that was 
20 years ago, but I don't recall anyone ever asking for summary to do anything else.  Your 
request is not a bad idea.
  Note however that the primary impact of using log(S) or S or log(log(S)) scale is is on 
the confidence intervals, and they do appear per request in the summary output.


Terry T.


On 06/28/2014 05:00 AM, r-help-requ...@r-project.org wrote:

Message: 9
Date: Fri, 27 Jun 2014 12:39:29 -0700
From: array chiparrayprof...@yahoo.com
To:r-help@r-project.org  r-help@r-project.org
Subject: [R] standard error of survfit.coxph()
Message-ID:
1403897969.91269.yahoomail...@web122906.mail.ne1.yahoo.com
Content-Type: text/plain

Hi, can anyone help me to understand the standard errors printed in the output 
of survfit.coxph()?

time-sample(1:15,100,replace=T)

status-as.numeric(runif(100,0,1)0.2)
x-rnorm(100,10,2)

fit-coxph(Surv(time,status)~x)
??? ### method 1

survfit(fit, newdata=data.frame(time=time,status=status,x=x)[1:5,], 
conf.type='log')$std.err

 [,1]??? [,2]??? [,3]??? [,4]?? [,5]
?[1,] 0.0 0.0 0.0 0.0 0.
?[2,] 0.008627644 0.008567253 0.008773699 0.009354788 0.01481819
?[3,] 0.008627644 0.008567253 0.008773699 0.009354788 0.01481819
?[4,] 0.013800603 0.013767977 0.013889971 0.014379928 0.02353371
?[5,] 0.013800603 0.013767977 0.013889971 0.014379928 0.02353371
?[6,] 0.013800603 0.013767977 0.013889971 0.014379928 0.02353371
?[7,] 0.030226811 0.030423883 0.029806263 0.028918817 0.05191161
?[8,] 0.030226811 0.030423883 0.029806263 0.028918817 0.05191161
?[9,] 0.036852571 0.037159980 0.036186931 0.034645002 0.06485394
[10,] 0.044181716 0.044621159 0.043221145 0.040872939 0.07931028
[11,] 0.044181716 0.044621159 0.043221145 0.040872939 0.07931028
[12,] 0.055452631 0.056018832 0.054236881 0.051586391 0.10800413
[13,] 0.070665160 0.071363749 0.069208056 0.066655730 0.14976433
[14,] 0.124140400 0.125564637 0.121281571 0.118002021 0.30971860
[15,] 0.173132357 0.175309455 0.168821266 0.164860523 0.46393111

survfit(fit, newdata=data.frame(time=time,status=status,x=x)[1:5,], 
conf.type='log')$time
?[1]? 1? 2? 3? 4? 5? 6? 7? 8? 9 10 11 12 13 14 15

??? ### method 2

summary(survfit(fit, newdata=data.frame(time=time,status=status,x=x)[1:5,], 
conf.type='log'),time=10)$std.err

? 1? 2? 3? 4? 5
[1,] 0.04061384 0.04106186 0.03963184 0.03715246 0.06867532

By reading the help of ?survfit.object and ?summary.survfit, the standard error provided 
in the output of method 1 (survfit()) was for cumulative hazard-log(survival), while the 
standard error provided in the output of method 2 (summary.survfit()) was for survival 
itself, regardless of how you choose the value for conf.type ('log', 
'log-log' or 'plain'). This explains why the standard error output is different between 
method 1 (10th row) and method 2.

My question is how do I get standard error estimates for log(-log(survival))?

Thanks!

John


__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


  1   2   >