Re: [R] Survey package
On Thu, 6 Sep 2007, eugen pircalabelu wrote: Good afternoon! I'm trying to use the Survey package for a stratified sample which has 4 criteria on which the stratification is based. I would like to get the corrected weights and for every element i get a weight of 1 E.g: tipping design - svydesign (id=~1, strata= ~regiune + size_loc + age_rec_hhh + size_hh, data= tabel) and then weights(design) gives me: 1,1,1,1,1,1,1,1,1,1,1,... for each element There are two problems. The first is that you have the wrong syntax for strata. If you have one stage of sampling with multiple stratifying factors you need to create a single factor representing the strata. One way is with interaction() design - svydesign (id=~1, strata= ~interaction(regiune, size_loc, age_rec_hhh, size_hh), data= tabel) Second, you have not specified either weights or population sizes, so R has no way to work out the sampling weights. That's why you get weights of 1. You should also get a warning. -thomas __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] the survey package
On Thu, 6 Sep 2007, Tobias Verbeke wrote: eugen pircalabelu wrote: I'm trying to use the survey package to get a better point of view for my data, but i need some piece of advice: i have some data from a survey which has been stratified using 2 criteria: region(7 values), size of locality(5 values) Using the survey pakage how can i define in a correct way this design (taking into account all 4 strata not just one as in the Survey example) snip According to ?svydesign, strata is a formula. The following should work (untested): design - svydesign(ids=~0, strata=~regiune + size_loc, data=tabel) This would be a two-stage sample, you actually need ~interaction(regiune, size_loc). [this reply is just to make sure it ends up linked in the archives]. -thomas __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Are the error messages of ConstrOptim() consisten with each other?
*1000.0)))*mat+lambda*lambda))-((ss+(tot/sh*1000.0)*lbar)/(tot/sh*1000.0)/lbar*exp(lambda*lambda))*pnorm(-sqrt((sigmae*ss/(ss+lbar*(tot/sh*1000.0)))*(sigmae*ss/(ss+lbar*(tot/sh*1000.0)))*mat+lambda*lambda)/2.0-log(((ss+(tot/sh*1000.0)*lbar)/(tot/sh*1000.0)/lbar*exp(lambda*lambda)))/sqrt((sigmae*ss/(ss+lbar*(tot/sh*1000.0)))*(sigmae*ss/(ss+lbar*(tot/sh*1000.0)))*mat+lambda*lambda)))*exp(-rint*mat)-(exp(rint*(lambda*lambda/(sigmae*ss/(ss+lbar*(tot/sh*1000.0)))/(sigmae*ss/(ss+lbar*(tot/sh*1000.0)*ss+(tot/sh*1000.0)*lbar)/(tot/sh*1000.0)/lbar*exp(lambda*lambda))^(sqrt(0! .! 25+2.0*rint/(sigmae*ss/(ss+lbar*(tot/sh*1000.0)))/(sigmae*ss/(ss+lbar* (tot/sh*1000.0+0.5)*pnorm(-log(((ss+(tot/sh*1000.0)*lbar)/(tot/sh*1000.0)/lbar*exp(lambda*lambda)))/((sigmae*ss/(ss+lbar*(tot/sh*1000.0)))*sqrt(mat+(lambda*lambda/(sigmae*ss/(ss+lbar*(tot/sh*1000.0)))/(sigmae*ss/(ss+lbar*(tot/sh*1000.0))-sqrt(0.25+2.0*rint/(sigmae*ss/(ss+lbar*(tot/sh*1000.0)))/(sigmae*ss/(ss+lbar*(tot/sh*1000.0*(sigmae*ss/(ss+lbar*(tot/sh*1000.0)))*sqrt(mat+(lambda*lambda/(sigmae*ss/(ss+lbar*(tot/sh*1000.0)))/(sigmae*ss/(ss+lbar*(tot/sh*1000.0))+((ss+(tot/sh*1000.0)*lbar)/(tot/sh*1000.0)/lbar*exp(lambda*lambda))^(-sqrt(0.25+2.0*rint/(sigmae*ss/(ss+lbar*(tot/sh*1000.0)))/(sigmae*ss/(ss+lbar*(tot/sh*1000.0+0.5)*pnorm(-log(((ss+(tot/sh*1000.0)*lbar)/(tot/sh*1000.0)/lbar*exp(lambda*lambda)))/((sigmae*ss/(ss+lbar*(tot/sh*1000.0)))*sqrt(mat+(lambda*lambda/(sigmae*ss/(ss+lbar*(tot/sh*1000.0)))/(sigmae*ss/(ss+lbar*(tot/sh*1000.0))+sqrt(0.25+2.0*rint/(sigmae*ss/(ss+lbar*(tot/sh*1000.0)))/(sigmae*ss/(ss+lbar*(tot/sh*1000.0*(sigmae*ss/(ss+! lb! ar*(tot/sh*1000.0)))*sqrt(mat+(lambda*lambda/(sigmae*ss/(ss+lbar*(tot/sh*1000.0)))/(sigmae*ss/(ss+lbar*(tot/sh*1000.0)))-(((ss+(tot/sh*1000.0)*lbar)/(tot/sh*1000.0)/lbar*exp(lambda*lambda))^(sqrt(0.25+2.0*rint/(sigmae*ss/(ss+lbar*(tot/sh*1000.0)))/(sigmae*ss/(ss+lbar*(tot/sh*1000.0+0.5)*pnorm(-log(((ss+(tot/sh*1000.0)*lbar)/(tot/sh*1000.0)/lbar*exp(lambda*lambda)))/((sigmae*ss/(ss+lbar*(tot/sh*1000.0)))*sqrt((lambda*lambda/(sigmae*ss/(ss+lbar*(tot/sh*1000.0)))/(sigmae*ss/(ss+lbar*(tot/sh*1000.0))-sqrt(0.25+2.0*rint/(sigmae*ss/(ss+lbar*(tot/sh*1000.0)))/(sigmae*ss/(ss+lbar*(tot/sh*1000.0*(sigmae*ss/(ss+lbar*(tot/sh*1000.0)))*sqrt((lambda*lambda/(sigmae*ss/(ss+lbar*(tot/sh*1000.0)))/(sigmae*ss/(ss+lbar*(tot/sh*1000.0))+((ss+(tot/sh*1000.0)*lbar)/(tot/sh*1000.0)/lbar*exp(lambda*lambda))^(-sqrt(0.25+2.0*rint/(sigmae*ss/(ss+lbar*(tot/sh*1000.0)))/(sigmae*ss/(ss+lbar*(tot/sh*1000.0+0.5)*pnorm(-log(((ss+(tot/sh*1000.0)*lbar)/(tot/sh*1000.0)/lbar*exp(lambda! *! lambda)))/((sigmae*ss/(ss+lbar*(tot/sh*1000.0)))*sqrt((lambda*lambda/( sigmae*ss/(ss+lbar*(tot/sh*1000.0)))/(sigmae*ss/(ss+lbar*(tot/sh*1000.0))+sqrt(0.25+2.0*rint/(sigmae*ss/(ss+lbar*(tot/sh*1000.0)))/(sigmae*ss/(ss+lbar*(tot/sh*1000.0*(sigmae*ss/(ss+lbar*(tot/sh*1000.0)))*sqrt((lambda*lambda/(sigmae*ss/(ss+lbar*(tot/sh*1000.0)))/(sigmae*ss/(ss+lbar*(tot/sh*1000.0))) apple.ana= function(rec1,lambda1,lbar1) {rec=rec1 lambda=lambda1 lbar=lbar1 apple=eval(apple2) gradient=cbind(eval(D(apple2,'rec')),eval(D(apple2,'lambda')), eval(D(apple2,'lbar'))) attr(apple.ana,'gradient')=gradient apple } fit.error=function(rec1,lambda1,lbar1) {rec=rec1 lambda=lambda1 lbar=lbar1 sum((eval(apple2)*1000-orange)^2/(orange^2)) } fit.error.grr=function(rec1,lambda1,lbar1) {rec=rec1 lambda=lambda1 lbar=lbar1 drec=sum(2*eval(D(apple2,'rec'))*(eval(apple2)*1-orange)/(orange^2)) dlambda=sum(2*eval(D(apple2,'lambda'))*(eval(apple2)*1-orange)/(orange^2)) dlbar=sum(2*eval(D(apple2,'lbar'))*(eval(apple2)*1-orange)/(orange^2)) c(drec,dlambda,dlbar) } ui=matrix(c(1,-1,0,0,0,0,0,0,1,-1,0,0,0,0,0,0,1,-1),6,3) ci=c(0,-0.5,0,-2,0,-0.6) constrOptim(c(0.5,0.3,0.5), f=fit.error, gr=fit.error.grr, ui=ui,ci=ci) constrOptim(c(0.5,0.9,0.5), f=fit.error, gr=fit.error.grr, ui=ui,ci=ci) constrOptim(c(0.3,0.5,0.5), f=fit.error, gr=fit.error.grr, ui=ui,ci=ci) ### __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Thomas Lumley Assoc. Professor, Biostatistics [EMAIL PROTECTED] University of Washington, Seattle __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Survey package
On Sun, 9 Sep 2007, eugen pircalabelu wrote: A short example: stratum id weight nh Nh y sex 1 1 3 5 15 23 1 1 2 3 5 15 25 1 1 3 3 5 15 27 2 1 4 3 5 15 21 2 1 5 3 5 15 22 1 2 6 4 3 12 33 1 2 7 4 3 12 27 1 2 8 4 3 12 29 2 where nh is size of sample stratum and Nh the corresponding population value, and y is metric variable. Now if i let design - svydesign( id=~1, data=age, strata=~stratum, fpc=~Nh) then weights(design) gives me 3,3,3,3,3,4,4,4. If i then let x- postStratify( design, strata=~sex, data.frame(sex=c(1,2), freq=c(10,15))) the weights become 123456 78 2.17 2.17 5.35 5.352.171.731.73 4.28 If i define design - svydesign( id=~1, data=age ) x- postStratify( design, strata=~sex, data.frame(sex=c(1,2), freq=c(10,15))) weights become 2 2 5 5 2 2 2 5 The question: does poststratify recognize that i have already stratified in the first design by stratum and then it post stratifies by sex? and why is that? (because i don't have the full joint distribution, the sex*stratum crossing, in order to apply correctly the post stratify function) I see that Mr Lumley uses the postStratify function when the design does not include strata (eg from ?poststratify: This gives you a design stratified by stratum and post-stratified by sex, which is not the same as stratifying by stratum*sex or post-stratifying by stratum*sex. In this case you should probably rake() on stratum and sex rather than just post-stratifying. Post-stratifying on sex is equivalent to one iteration of the iterative proportional fitting algorithm used in raking. -thomas __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] mode or parameters of readBin
On Mon, 10 Sep 2007, Sigbert Klinke wrote: Hi, sapply(formals(readBin), mode) con what n sizesignedendian namename numeric logical logicalcall returns for the mode of size logical. But in the documentation is said that size should be integer. Does anyone know why the mode is logical? Because NA is a logical constant. -thomas __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Excel (off-topic, sort of)
On Wed, 29 Aug 2007, Alberto Monteiro wrote: Do you know what's in my wish list? I wish spreadsheets and computer languages had gone one step further. I mean, it's nice to define Cell X to be equal to Cell Y + 10, and then when we change Cell Y, magically we see Cell X change. But why can't it be the reverse? Why can't I change Cell X _and see the change in Cell Y_? Well, most statistical calculations are data-reducing, ie, many-to-one. I don't think you are likely to find a language where you can change the confidence limits for the sample mean and have the data change in response. -thomas Thomas Lumley Assoc. Professor, Biostatistics [EMAIL PROTECTED] University of Washington, Seattle __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] subset using noncontiguous variables by name (not index)
On Mon, 27 Aug 2007, Muenchen, Robert A (Bob) wrote: Gabor, That works great! I think this would be a very helpful addition to the main R distribution. Perhaps with a single colon representing numerical order (exactly as you have written it) and two colons representing the order of the variables as they appear in the data frame (your first example). That's analogous to SAS' x1-xN, which you know gets those N variables, and a--z, which selects an unknown number of variables a through z. How many that is depends upon their order in the data frame. That would not only be very useful in general, but it would also make transitioning to R from SAS or SPSS less confusing. Is R still being extended in such basic ways, or does that muck up existing programs too much? In principle base R can be extended like that, but a strong case is needed for non-standard evaluation rules and for depleting the restricted supply of short binary operator names. The reason for subset() and its behaviour is that 'variables as they appear the in data frame' is typically ambiguous -- which data frame? In SPSS you have only one and in SAS there is a default one, so there is no ambiguity in X1--Y2, but in R it needs another argument specifying the data frame, so it can't really be a binary operator. The double colon :: and triple colon ::: are already used for namespaces, and a search of r-help reveals two previous, different, suggestions for %:%. -thomas Thomas Lumley Assoc. Professor, Biostatistics [EMAIL PROTECTED] University of Washington, Seattle __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Max vs summary inconsistency
On Mon, 27 Aug 2007, Adam D. I. Kramer wrote: Hello, I'm having the following questionable behavior: summary(m) Min. 1st Qu. MedianMean 3rd Qu.Max. 1 13000 26280 25890 38550 50910 max(m) [1] 50912 typeof(m) [1] integer class(m) [1] integer ...it seems to me like max() and summary(m)[6] ought to return the same number. Am I doing something wrong? They do return the same number, they just print it differently. summary() prints four significant digits by default. -thomas Thomas Lumley Assoc. Professor, Biostatistics [EMAIL PROTECTED] University of Washington, Seattle __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] oddity with method definition
On Mon, 27 Aug 2007, Faheem Mitha wrote: Just wondered about this curious behaviour. I'm trying to learn about classes. Basically setMethod works the first time, but does not seem to work the second time. Faheem. * setClass(foo, representation(x=numeric)) bar - function(object) { return(0) } bar.foo - function(object) { print([EMAIL PROTECTED]) } setMethod(bar, foo, bar.foo) bar(f) # bar(f) gives 1. Not for me. It gives bar(f) Error: object f not found Error in bar(f) : error in evaluating the argument 'object' in selecting a method for function 'bar' However, if I do f = new(foo, x= 1) first, it gives 1. bar - function(object) { return(0) } Here you have masked the generic bar() with a new function bar(). Redefining bar() is the problem, not the second setMethod(). bar.foo - function(object) { print([EMAIL PROTECTED]) } setMethod(bar, foo, bar.foo) Because there was a generic bar(), even though it is overwritten by the new bar(), setMethod() doesn't automatically create another generic. f = new(foo, x= 1) bar(f) # bar(f) gives 0, not 1. Because bar() isn't a generic function bar function(object) { return(0) } If you had used setGeneric() before setMethod(), as recommended, your example would have done what you expected, but it would still have wiped out any previous methods for bar() -- eg, try setMethod(bar,baz, function(object) print(baz)) before you redefine bar(), and notice that getMethod(bar,baz) no longer finds it. -thomas Thomas Lumley Assoc. Professor, Biostatistics [EMAIL PROTECTED] University of Washington, Seattle __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Saving results from Linux command line
There could still be functions that divert a copy of all the output to a file, for example. And indeed there are. sink(transcript.txt, split=TRUE) -thomas On Fri, 24 Aug 2007, Muenchen, Robert A (Bob) wrote: I looked long and hard for that information. Thank you VERY much! -Bob -Original Message- From: Richard M. Heiberger [mailto:[EMAIL PROTECTED] Sent: Friday, August 24, 2007 1:52 PM To: Muenchen, Robert A (Bob); r-help@stat.math.ethz.ch Subject: Re: [R] Saving results from Linux command line There can't be functions in the R language to save the transcript of a session. In this respect R is a filter. It takes an input stream of text and returns an output stream of text. R doesn't remember the streams. The Windows RGui remembers them. The ESS *R* buffer remembers them. Any terminal emulator could in principle remember them. R itself can't. __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Thomas Lumley Assoc. Professor, Biostatistics [EMAIL PROTECTED] University of Washington, Seattle __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] R old versions archive anywhere?
On Thu, 23 Aug 2007, [EMAIL PROTECTED] wrote: Hi Folks, Does anyone know if (apparently not on CRAN) there is any archive of older versions of R packages? Yes. On CRAN. At the bottom of the page listing all the packages there is a section - Related Directories Archive Previous versions of the packages listed above. - linking to [CRAN]/src/contrib/Archive -thomas __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] All subsets regression
On Wed, 22 Aug 2007, Alan Harrison wrote: Hey folks, I'm trying to do all subsets on a zero-inflated poisson regression. I'm aware of the leaps and regsubsets functions but I don't know if they work for ZIP regressions or how the syntax fits in for them. They don't. -thomas Thomas Lumley Assoc. Professor, Biostatistics [EMAIL PROTECTED] University of Washington, Seattle __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Regulatory Compliance and Validation Issues
On Fri, 17 Aug 2007, Cody Hamilton wrote: snip I have a few specific comments/questions that I would like to present to the R help list. snip 2. While the document's scope is limited to base R plus recommended packages, I believe most companies will need access to functionalities provided by packages not included in the base or recommended packages. (For example, I don't think I could survive without the sas.get() function from the Design library.) How can a company address the issues covered in the document for packages outside its scope? For example, what if a package's author does not maintain historical archive versions of the package? What if the author no longer maintains the package? Is the solution to add more packages to the recommended list (I'm fairly certain that this would not be a simple process) or is there another solution? This will have to be taken up with the package maintainer. The R Foundation doesn't have any definitive knowledge about, eg, Frank Harrell's development practices and I don't think the FDA would regard our opinions as relevant. Archiving, at least, is addressed by CRAN: all the previously released versions of packages are available 3. At least at my company, each new version must undergo basically the same IQ/OQ/PQ as the first installation. As new versions of R seem to come at least once a year, the ongoing validation effort would be painful if the most up-to-date version of R is to be maintained within the company. Is there any danger it delaying the updates (say updating R within the company every two years or so)? It's worse than that: there are typically 4 releases of R per year (the document you are commenting on actually gives dates). The ongoing validation effort may indeed be painful, and this was mentioned as an issue in the talk by David James Tony Rossini. The question of what is missed by delaying updates can be answered by looking at the NEWS file. The question of whether it is dangerous is really an internal risk management issue for you. -thomas __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Question about lme and AR(1)
On Sat, 18 Aug 2007, Francisco Redelico wrote: Dear R users, As far as I know, EM algorithm can be only applied to estimate parameter from a regular exponential family. No. The EM algorithm will converge to a stationary point (and except in artificial cases, to a local maximum) for any likelihood. The special case you may be thinking of is that in some problems the E-step is equivalent to computing E[missing data | observed data] rather than the more general E[loglikelihood|observed data] -thomas Thomas Lumley Assoc. Professor, Biostatistics [EMAIL PROTECTED] University of Washington, Seattle __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] using sampling weights in glm
On Wed, 15 Aug 2007, Werner Wernersen wrote: Hi, I have a supposedly representative sample from a stratified survey. Hence, all observations have an associated sample weight each which inflate the sample to population size. I want to run a glm probit regression on the data but I am not clear about if or better how I can use the weights in it. From the description of the weights argument of glm it seems to me that I cannot plug these weights in there. You want svyglm() in the survey package. -thomas Thomas Lumley Assoc. Professor, Biostatistics [EMAIL PROTECTED] University of Washington, Seattle __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Regsubsets statistics
On Wed, 8 Aug 2007, Markus Brugger wrote: Dear R-help, I have used the regsubsets function from the leaps package to do subset selection of a logistic regression model with 6 independent variables and all possible ^2 interactions. As I want to get information about the statistics behind the selection output, I´ve intensively searched the mailing list to find answers to following questions: 1. What should I do to get the statistics behind the selection (e.g. BIC)? summary.regsubsets(object) just returns * meaning in or meaning out. For the plot function generates BICs, it is obviously that these values must be computed and available somewhere, but where? Is it possible to directly get AIC values instead of BIC? These statistics are in the object returned by summary(). Using the first example from the help page names(summary(a)) [1] which rsqrssadjr2 cp bicoutmat obj summary(a)$bic [1] -19.60287 -28.61139 -35.65643 -37.23388 -34.55301 2. As to the plot function, I´ve encountered a problem with setting the ylim argument. I fear that this (nice!) particular plot function ignores many of these additional arguments. How can I nevertheless change this setting? You can't (without modifying the plot function). The ... argument is required for inheritance [ie, required for R CMD check] but it doesn't take graphical parameters 3. For it is not explicitly mentioned in the manual, can I really use regsubsets for logistic regression? No. If your data set is large enough relative to the number of variables, you can fit a model with all variables and then apply regsubsets() to the weighted linear model arising from the IWLS algorithm. This will give an approximate ranking of models that you can then refit exactly. This is useful if you wanted to summarize the best few thousand models on 30 variables but not if you want a single model. On the other hand, regsubsets() isn't useful if you want a single model anyway. -thomas Thomas Lumley Assoc. Professor, Biostatistics [EMAIL PROTECTED] University of Washington, Seattle __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Mixture of Normals with Large Data
On Wed, 8 Aug 2007, Martin Maechler wrote: BertG == Bert Gunter [EMAIL PROTECTED] on Tue, 7 Aug 2007 16:18:18 -0700 writes: TV Have you considered the situation of wanting to TV characterize probability densities of prevalence TV estimates based on a complex random sample of some TV large population. BertG No -- and I stand by my statement. The empirical BertG distribution of the data themselves are the best BertG characterization of the density. You and others are BertG free to disagree. I do agree with you Bert. From a practical point of view however, you'd still want to use an approximation to the data ECDF, since the full ecdf is just too large an object to handle conveniently. One simple quite small and probably sufficient such approximation maybe using the equivalent of quantile(x, probs = (0:1000)/1000) which is pretty related to just working with a binned version of the original data; something others have proposed as well. I have done Normal (actually logNormal) mixture fitting to pretty large data (particle counts by size) for summary purposes. In that case it would not have done just as well to use quantiles as I had many sets of data (every three hours for several months) and the locations of the mixture components drift around over time. The location, scale, and mass of the four mixture components really were the best summaries. This was the application that constrOptim() was written for. -thomas Thomas Lumley Assoc. Professor, Biostatistics [EMAIL PROTECTED] University of Washington, Seattle __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] small sample techniques
On Wed, 8 Aug 2007, Nair, Murlidharan T wrote: Indeed, I understand what you say. The df of freedom for the dummy example is n1+n2-2 = 8. But when I run the t.test I get it as 5.08, am I missing something? Yes. You are probably looking for the version of the t.test that assumes equal variances (the original one), so you need var.equal=TRUE. -thomas -Original Message- From: Moshe Olshansky [mailto:[EMAIL PROTECTED] Sent: Tuesday, August 07, 2007 9:05 PM To: Nair, Murlidharan T; r-help@stat.math.ethz.ch Subject: Re: [R] small sample techniques Hi Nair, If the two populations are normal the t-test gives you the exact result for whatever the sample size is (the sample size will affect the number of degrees of freedom). When the populations are not normal and the sample size is large it is still OK to use t-test (because of the Central Limit Theorem) but this is not necessarily true for the small sample size. You could use simulation to find the relevant probabilities. --- Nair, Murlidharan T [EMAIL PROTECTED] wrote: If my sample size is small is there a particular switch option that I need to use with t.test so that it calculates the t ratio correctly? Here is a dummy example? á =0.05 Mean pain reduction for A =27; B =31 and SD are SDA=9 SDB=12 drgA.p-rnorm(5,27,9); drgB.p-rnorm(5,31,12) t.test(drgA.p,drgB.p) # what do I need to give as additional parameter here? I can do it manually but was looking for a switch option that I need to specify for t.test. Thanks ../Murli [[alternative HTML version deleted]] __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Thomas Lumley Assoc. Professor, Biostatistics [EMAIL PROTECTED] University of Washington, Seattle __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Robust Standard Errors for lme object
On Tue, 7 Aug 2007, Doran, Harold wrote: Lucy: Why are you interested in robust standard errors from lme? Typically, robust standard errors are sought when there is model misspecification due to ignoring some covariance among units with a group. But, a mixed model is designed to directly account for covariances among units within a group such that the standard errors more adequately represent the true sampling variance of the parameters. I think it's a perfectly reasonable thing to want, but it is not easy to provide because of the generality of lme(). For example, models with crossed effects need special handling, and possibly so do time-series or spatial covariance models with large numbers of observations per group. I imagine that misspecification of the variance, rather than the correlation, would be the main concern, just as it is with independent observations. Of course the model-robust variances would only be useful if the sample size of top-level units was large enough, and if the variance components were not of any direct interest. So, the lme standard errors are robust in a sense that they are presumably correct if you have your model correctly specified. To paraphrase the Hitchikers' Guide: This must be some definition of the word 'robust' that I was not previously aware of. :) -thomas Thomas Lumley Assoc. Professor, Biostatistics [EMAIL PROTECTED] University of Washington, Seattle __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] [R-pkgs] surveyNG (and survey)
'surveyNG' version 0.3 is on CRAN. This package provides experimental features for survey analysis that may be incorporated in the survey package in the future. Currently there are facilities for analysis of complex surveys using (possibly large) data sets stored in a SQLite database. However, analysis facilities for these SQL-backed survey designs are rather more limited than in the 'survey' package. Version 0.3 adds hexagonal binning plots and kernel smoothing. Also, the 'survey' package hasn't been announced on this list since version 2.9 in 2005 and verison 3.6-11 was recently posted. It provides fairly comprehensive facilities for analysis of complex survey designs. Major additions since 2.9 are calibration estimators (aka GREG or generalized raking), simple two-phase designs, and smoothing. -thomas Thomas Lumley Assoc. Professor, Biostatistics [EMAIL PROTECTED] University of Washington, Seattle ___ R-packages mailing list [EMAIL PROTECTED] https://stat.ethz.ch/mailman/listinfo/r-packages __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] constrOptim
On Wed, 1 Aug 2007, Joanne Lee wrote: The constraints on the optimization problem are: 1 - components of potentialargmin must add to 1. 2 - each potentialargmin component must be (weakly) positive and (weakly) less than 1. 3 - potentialargmin %*% c(1,2,3,4,5,6,7,8,9) = 4.5 constrOptim() is not good at equality constraints, so constraints 1 and 3 would be better expressed by reparametrization. There is no need to constrain the parameters to be positive as the function already goes infinite at zero. Finally, if the parameters are non-negative there is no need to constrain them to be =1, as this is implied by the first constraint. It might be interesting to find out if some automated Lagrange-multiplier approach could be built into constrOptim() for equality constraints, but it is not a high enough priority that I am likely to do it. -thomas Thomas Lumley Assoc. Professor, Biostatistics [EMAIL PROTECTED] University of Washington, Seattle __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Nonlinear optimization with constraints
On Wed, 1 Aug 2007, Patrick Burns wrote: If you put in penalties for breaking constraints, it is generally better to make the penalty depend on the size of the violation. This keeps the function continuous (though usually not differentiable at the boundary), and gives the optimizer a hint about which way to go. And this is what constrOptim does (though in a slightly more sophisticated way that keeps the function twice-differentiable at the boundary). I don't know why constrOptim() isn't applicable here -- the constraint looks like a linear inequality to me. -thomas Thomas Lumley Assoc. Professor, Biostatistics [EMAIL PROTECTED] University of Washington, Seattle __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] - round() strange behaviour
On Thu, 2 Aug 2007, Monica Pisica wrote: Hi, I am getting some strange results using round - it seems that it depends if the number before the decimal point is odd or even Yes. This is explained in the help page for round(). -thomas Thomas Lumley Assoc. Professor, Biostatistics [EMAIL PROTECTED] University of Washington, Seattle __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Q: calling par() in .First()
On Thu, 2 Aug 2007, D. R. Evans wrote: As a result of another thread here, I need to be sure that the function par(bg='white') has executed before I create a plot. The simplest thing seemed to put it in .Rprofile: .First - function() { options(width=150) par(bg='white') } But now R complains at startup: Error in .First() : couldn't find function par I have confirmed that once R has started, I can type par(bg='white') and R no longer complains. ISwR has a nice little section on Using par, but there's no hint that something like this can go wrong :-( So what am I not understanding this time? par() is in the 'graphics' package, which is not loaded by the time .Rprofile runs. You want graphics::par(bg='white') Information from which this can be deduced and examples are in ?Startup, though it isn't explicitly stated there. -thomas Thomas Lumley Assoc. Professor, Biostatistics [EMAIL PROTECTED] University of Washington, Seattle __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] reading stata files: preserving values of variables converted to factors
On Thu, 26 Jul 2007, Ben Saylor wrote: Hi, I am a Stata user new to R. I am using read.dta to read a Stata file that has variables with value labels. read.dta converts them to factors, but seems to recode them with values from 1 to number of factor levels (looking at the output of unclass(varname)), so the original numerical values are lost. Yes. The R factor type should not be used if you want the original levels. It is not a 'labelled numeric' type and the numbers are an implementation detail. Using convert.factors=FALSE preserves the values, but seems to discard the labels. It doesn't discard the labels. They are kept in the attributes of the data frame. -thomas __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Minitab Parametric Distribution Analysis in R
The survival package (survreg() function) will fit quite a few parametric models under censoring. If you aren't doing regression, but just one-sample fitting, you can feed the appropriate censored or truncated likelihood to mle() in the stat4 package. Both packages should be part of your R distribution. -thomas On Wed, 25 Jul 2007, Tom La Bone wrote: Minitab can perform a Parametric Distribution Analysis - Arbitrary Censoring with one of eight distributions (e.g., weibull), giving the maximum likelihood estimates of the parameters in the distribution for a given dataset. Does R have a package that provides equivalent functionality? Thanks for any advice you can offer. Tom La Bone [[alternative HTML version deleted]] __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Thomas Lumley Assoc. Professor, Biostatistics [EMAIL PROTECTED] University of Washington, Seattle __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] (no subject)
It's a FAQ (Question 7.32) -thomas On Mon, 23 Jul 2007, [EMAIL PROTECTED] wrote: Dear Sir/Madam, I am running a R program for my SNP data. There are some errors when I run glm model in Hapassoc software, sometimes it is over the memory and sometimes the matrix is singular. I want to ingore these errors and excute the next statement. Otherwise, I have a big big trouble. Do you have some idea about this problem of ingore errors. Wish to get your help assp. thanks. -- Wei Guo Department of Statistics The Ohio State University Columbus, Ohio 43210 Tel: 001-614-292-4713(o) e-mail: [EMAIL PROTECTED] __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Thomas Lumley Assoc. Professor, Biostatistics [EMAIL PROTECTED] University of Washington, Seattle __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] GEE code
On Fri, 20 Jul 2007, Dimitris Rizopoulos wrote: from your description I think you need a random-effects model. Since he specifically said fixed effects I would have assumed from the description that he wanted a fixed-effects model. Since you assume normality the parameter estimates from a random-effects model (in your case `GDP.log2') have a marginal interpretation (in fact they have both conditional and marginal interpretation). But they still aren't the fixed effects estimates, and they behave quite differently under model misspecification (and under confounding). A fixed effects linear model with GEE just needs the formula ineq~GDP.log2+country. to specify an indicator variable for each country. If he had a logistic regression model things would be more complicated, but for a linear or log-linear model it is just a matter of adding predictors. Now, I might well use a linear mixed model in this context, but he did fairly clearly indicate that wasn't he was looking for. -thomas Thomas Lumley Assoc. Professor, Biostatistics [EMAIL PROTECTED] University of Washington, Seattle __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] R
On Thu, 19 Jul 2007, Fluss wrote: Hello! I am using for logistic regression in survey data the svyglm procedure. I wondered how does the strata effect estimates SE (in addition to the weights given proportional to population size). I know that for simple regression measurements of each strata is assumed to have different variance. But in a logistic model this is not the case. It is simpler (and more complicated) than that. The survey package uses the same formula for nearly all designs and estimators, so it doesn't have to handle special cases like estimating separate stratum variances for stratified models. For a population total the variance estimator is just the Horvitz-Thompson estimator. Other estimators are defined by the estimating equations they solve, so the mean solves sum_i w_i(x_i-mu) = 0 and logistic regression solves sum_i w_ix_i(y_i-mu_i) = 0 We now compute the Horvitz-Thompson estimator for the sum of the estimating functions (V) and also the population total of the derivative of the estimating functions (D). The variance of the estimator is D^{-1}VD^{-1} The standard reference for this in the survey literature seems to be Binder, David A. (1983). On the variances of asymptotically normal estimators from complex surveys. International Statistical Review, 51, 279-292. which is in the References section of help(svyrecvar). -thomas __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Complex surveys, properly computed SEs and non-parametric analyses
On Sun, 15 Jul 2007, Tobias Verbeke wrote: The survey package of Thomas Lumley has very broad functionality for the analysis of data from complex sampling designs. Please find below the homepage of the package (which is available on CRAN): http://faculty.washington.edu/tlumley/survey/ I don't think non-parametric one-way ANOVA is implemented No. but quoting http://faculty.washington.edu/tlumley/survey/survey-wss.pdf Many features of the survey package result from requests from unsatisfied users. For new methods the most important information is a reference that gives sufficient detail for implementation. A data set is nice but not critical. Yes, and the details are especially non-obvious here. The test won't be small-sample exact, AFAICS, and it isn't clear whether the idea is to add weights to the influence function for the signed-rank test or to replace it with a design-based estimate of a population quantity. Often these approaches are equivalent, but they won't be in this case. It wouldn't have occured to me that people would want this. `Non-parametric' isn't really a relevant idea since design-based inference assumes a completely known model for the sampling indicators and doesn't even treat the data as random variables. All this goes to say that if there is a standard quantity that John wants, it will have resulted in part from a set of arbitrary decisions, and it won't be possible to reverse-engineer the estimator in the absence of a precise description. This is in contrast to apparently more complicated analyses such as calibration estimators for Cox models in case-cohort designs, which follow just by putting standard pieces together in an obvious way. -thomas __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Help with write.foreign (exporting data to Stata)
On Tue, 10 Jul 2007, Stefan Grosse wrote: I am not sure what you are doing there but what you need is library(foreign) and write.dta() write.foreign should also work, though. My guess is that Kate used tempfile() to specify the filenames, and that the data file would then have been deleted on leaving R. This is only a guess, of course. The syntax for write.dta is write.dta(the.data.set, file=dataset.dta) and for write.foreign is write.foreign(the.data.set,codefile=dataset.do, datafile=dataset.raw, package=Stata) -thomas see ?write.dta once you have loaded the foreign package Stefan Original Message Subject: [R] Help with write.foreign (exporting data to Stata) From: kdestler [EMAIL PROTECTED] To: r-help@stat.math.ethz.ch Date: Tue Jul 10 2007 19:37:54 GMT+0200 Hi. I'm trying to export a dataframe from R into Stata to use a statistical function I have there. I attached library write.foreign and renamed my variables to get them to match Stata's required format, and now have the following error: file /tmp/Rtmps7rmrM/file1c06dac8.raw not found Other than typing write.foreign, do I need to do something in R to get it to save the file on my hard drive? When I search for the file name on my computer nothing comes up. I'm using a Mac in case that makes a difference. Thanks, Kate __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Thomas Lumley Assoc. Professor, Biostatistics [EMAIL PROTECTED] University of Washington, Seattle __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] character string to name
On Mon, 9 Jul 2007, Jim Lemon wrote: Hi folks, I thought I recalled a request for turning a character string into an object name as in: Yes. It's a FAQ. -thomas x$as.name(y)-1:4 OR x-data.frame(as.name(y)=1:4) However, as.name and a few other uninformed attempts didn't even come close. A search of character to name produced no helpful functions. This isn't a very urgent request, but if anyone knows some trick to perform this transformation, I would like to hear about it. Thanks. Jim __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Thomas Lumley Assoc. Professor, Biostatistics [EMAIL PROTECTED] University of Washington, Seattle __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Is it a bug ?
On Thu, 5 Jul 2007, Giuseppe PEDRAZZI wrote: I am using R 2.5.0, windows XP - italian language. I was perfoming some calculation on fractional exponential and I found a strange behaviour. I do not know if it is really a bug, but I would expect a different answer from R. I was trying the following : x - seq(-3,3, by =0.1) n - 2.2 y - exp(-x^n) well, the y vector contains (NaN for all negative value of x) Yes. Non-integer powers of negative numbers are undefined (unless you use complex numbers). but if you ask for single value calculation like y - exp(-(-3)^2.2) or y - exp(-(-2.9)^2.2) the answer is correct. I get NaN for both of these. Perhaps you mean exp(-2.9^2.2)? This gives a valid answer, but that is because it is exp(-(2.9^2.2)) not exp((-2.9)^2.2) It seem it does not make the calculation in vector form. I got the same behaviour (NaN) in a for loop for(i in 1:length(x)) y[i]=exp(x[i]^n) y [1] NaN NaN NaN NaN NaN NaN NaN NaN NaN [10] NaN NaN NaN NaN NaN NaN NaN NaN NaN [19] NaN NaN NaN NaN NaN NaN NaN NaN NaN [28] NaN NaN NaN 1.00 1.006330 1.029416 1.073302 1.142488 1.243137 [37] 1.384082 1.578166 1.844237 2.210260 2.718282 3.432491 4.452553 5.936068 8.137120 [46]11.47374616.64841524.86768038.25129560.611092 98.967689 166.572985 289.08 517.425935 [55] 955.487320 1820.793570 3581.521323 7273.674928 15255.446778 33050.861013 73982.100407 Is it strange or did I miss something ? You missed something. It is not clear what you missed because some of your examples do not give the answer you say they give. -thomas __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Determining whether a function's return value is assigned
On Sat, 30 Jun 2007, Paul Laub wrote: Dear all, Does R offer a means by which a function can determine whether its return value is assigned? I am using R 2.4.1 for Windows. No Suppose what I am looking for is called return.value.assigned. Then one might use it like this myfunction - function () { # Create bigobject here if (return.value.assigned()) { bigobject } else { summary(bigobject) } } That's what the print() function is for. Write a print() method. -thomas __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] extract index during execution of sapply
On Fri, 22 Jun 2007, Ben Bolker wrote: Christian Bieli christian.bieli at unibas.ch writes: Hi there During execution of sapply I want to extract the number of times the function given to supply has been executed. I came up with: mylist - list(a=3,b=6,c=9) sapply(mylist,function(x)as.numeric(gsub([^0-9],,deparse(substitute(x) This works fine, but looks quite ugly. I'm sure that there's a more elegant way to do this. Any suggestion? Christian I would love to have an answer to this -- when I run into this kind of problem I usually end up using mapply: e.g., suppose I have mylist - replicate(5,list(x=runif(10),y=runif(10)),simplify=FALSE) and I want to plot each element in a different color. I'd like to be able to do plot(0:1,0:1,type=n) lapply(mylist,plot,col=i) but instead I do mapply(function(x,i) points(x,col=i),mylist,1:5) would it be too ugly to have a special variable called INDEX that could be used within an sapply/lapply statement? There are two distinct suggestions here: a variable that says *how many* times the function has been called, and a variable that say *which element* is currently being operated on. The first seems undesirable as order of evaluation really should not matter in the apply functions. The second makes more sense but is still a little tricky. AFAICS there is no way for lapply() to find out whether FUN will accept an argument INDEX without an unused argument(s) error, so it can't just be passed as an argument. This suggests having yet another apply function, that would assume an INDEX argument and might be written yapply-function(X,FUN, ...) { index-seq(length.out=length(X)) mapply(FUN,X,INDEX=index,MoreArgs=list(...)) } However, I think it would be preferable in many cases for INDEX to be names(X) if it exists, rather than 1:n. In any case, it is easy to write the function. -thomas __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] model selection criteria in regsubsets
The calculations are in summary.regsubsets. Sending three copies of questions like this does not increase the chance of a response. -thomas On Thu, 21 Jun 2007, [EMAIL PROTECTED] wrote: Hi All, I used regsubsets in package leaps to do the model subset selection. I noticed the bic and cp criterias are both included in this function. But seems like they are not calculated by bic=-n*log(RSS/n)-(p+1)*log(n) and cp=(RSS/sigma_hat^2)-(n-2*p-2) Could you please let me know what formula used for these two criterias? Thank you ! Linda Thomas Lumley Assoc. Professor, Biostatistics [EMAIL PROTECTED] University of Washington, Seattle __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] psm/survreg coefficient values ?
On Tue, 19 Jun 2007, John Logsdon wrote: In survreg() the predictor is log(characteristic life) for Weibull (= exponential when scale=1) - ie the 63.2%ile. For the others the predictor is log(median). This causes problems when comparing predictions and a better way IMHO is to correct the Weibull prediction by a factor (log(2))^(1/scale). This is only a simple multiple unless the shape parameter is also being modelled, when a completely different solution may arise. Such heterogeneity modelling cannot of course be done within survreg(). Except, of course, for a discrete predictor of heterogeneity, using strata(). -thomas On Monday 18 June 2007 22:56:54 Frank E Harrell Jr wrote: sj wrote: I am using psm to model some parametric survival data, the data is for length of stay in an emergency department. There are several ways a patient's stay in the emergency department can end (discharge, admit, etc..) so I am looking at modeling the effects of several covariates on the various outcomes. Initially I am trying to fit a survival model for each type of outcome using the psm function in the design package, i.e., all patients who's visits come to an end due to any event other than the event of interest are considered to be censored. Being new to the psm and survreg packages (and to parametric survival modeling) I am not entirely sure how to interpret the coefficient values that psm returns. I have included the following code to illustrate code similar to what I am using on my data. I suppose that the coefficients are somehow rescaled , but I am not sure how to return them to the original scale and make sense out of the coefficients, e.g., estimate the the effect of higher acuity on time to event in minutes. Any explanation or direction on how to interpret the coefficient values would be greatly appreciated. this is from the documentation for survreg.object. coefficientsthe coefficients of the linear.predictors, which multiply the columns of the model matrix. It does not include the estimate of error (sigma). The names of the coefficients are the names of the single-degree-of-freedom effects (the columns of the model matrix). If the model is over-determined there will be missing values in the coefficients corresponding to non-estimable coefficients. code: LOS - sort(rweibull(1000,1.4,108)) AGE - sort(rnorm(1000,41,12)) ACUITY - sort(rep(1:5,200)) EVENT - sample(x=c(0,1),replace=TRUE,1000) psm(Surv(LOS,EVENT)~AGE+as.factor(ACUITY),dist='weibull') output: psm(formula = Surv(LOS, CENS) ~ AGE + as.factor(ACUITY), dist = weibull) Obs Events Model L.R. d.f. P R2 10005132387.62 5 0 0.91 Value Std. Error z p (Intercept) 1.10550.04425 24.98 8.92e-138 AGE 0.07720.00152 50.93 0.00e+00 ACUITY=2 0.09440.01357 6.96 3.39e-12 ACUITY=3 0.17520.02111 8.30 1.03e-16 ACUITY=4 0.13910.02722 5.11 3.18e-07 ACUITY=5-0.05440.03789 -1.43 1.51e-01 Log(scale)-2.72870.03780 -72.18 0.00e+00 Scale= 0.0653 best, Spencer I have a case study using psm (survreg wrapper) in my book. Briefly, coefficients are on the log median survival time scale. Frank -- Best wishes John John Logsdon Try to make things as simple Quantex Research Ltd, Manchester UK as possible but not simpler [EMAIL PROTECTED] [EMAIL PROTECTED] +44(0)161 445 4951/G:+44(0)7717758675 www.quantex-research.com __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Thomas Lumley Assoc. Professor, Biostatistics [EMAIL PROTECTED] University of Washington, Seattle __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] BIC and Hosmer-Lemeshow statistic for logistic regression
On Tue, 19 Jun 2007, spime wrote: Is there any windows version of Design package??? Not at the moment. It is being updated for changes in R 2.5.0. [This would be a FAQ except that it should stop being asked soon] -thomas Frank E Harrell Jr wrote: spime wrote: I haven't find any helpful thread. How can i calculate BIC and Hosmer-Lemeshow statistic for a logistic regression model. I have used glm for logistic fit. See the Design package's lrm function and residuals.lrm for a better GOF test. -- Frank E Harrell Jr Professor and Chair School of Medicine Department of Biostatistics Vanderbilt University __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- View this message in context: http://www.nabble.com/BIC-and-Hosmer-Lemeshow-statistic-for-logistic-regression-tf3945943.html#a11195410 Sent from the R help mailing list archive at Nabble.com. __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Thomas Lumley Assoc. Professor, Biostatistics [EMAIL PROTECTED] University of Washington, Seattle __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Error handling
This is FAQ 7.32 How can I capture or ignore errors in a long simulation? -thomas On Tue, 19 Jun 2007, Peter Sajosi wrote: Hello, I have a question about error handling. I run simulation studies and often the program stops with an error, for example during maximum likelihood. I would like the program not to stop but to continue and I would like to ask how the error handling can be set up for this (if it can). I tried to look through manuals etc but unfortunately did not get closer to the solution. Below is a small example with some numbers, where the nlm function gives an error. Is it possible to make R and the program ignore the error? (there is a small for loop in the end of the example, which breaks - ideally the program would complete the for loop even though there are errors). Of course this is just an example, in the simulation study the error comes up quite rarely but still it is annoying to handle it manually each time. Many thanks Peter The example: logfunc - function (params) { vutil1 - as.matrix(x2[,1:3]) %*% params vutil2 - as.matrix(x2[,4:6]) %*% params logl - 0 for (i in 1:6) { prob - log((exp(vutil1[i])*achoices[i,1]+exp(vutil2[i])*achoices[i,2])/(exp(vutil1[i])+exp(vutil2[i]))) logl - logl + prob } return (-logl) } x2 - array(c(0,4,1,3,5,3,3,2,1,4,1,2,0,2,2,1,1,4,1.2233310 ,0.000 ,0.8155540 ,0.9320617 ,1.4272195 ,1.8349965 , 0.6116655, 3.2622160, 0.8155540, 3.7282469,0.000 ,4.5874913 ,0.6116655,3.2622160 ,1.6311080 ,1.8641235, 4.2816586, 0.9174983),dim=c(6,6)) achoices - array(c(1,0,1,0,1,0,0,1,0,1,0,1),dim=c(6,2)) for (k in 1:5) { nlm(logfunc, c (1,1,1),print.level=2) } --- Thanks!!! --- - Moody friends. Drama queens. Your life? Nope! - their life, your story. [[alternative HTML version deleted]] __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Thomas Lumley Assoc. Professor, Biostatistics [EMAIL PROTECTED] University of Washington, Seattle __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] importing .dta files
On Fri, 15 Jun 2007, Chris Linton wrote: I'm trying to read in a Stata file but I've never used this function ( read.dta). It's the only one that seems to come close to working, but I keep getting this error: data-read.dta(C:/Documents and Settings/Chris/Desktop/S4412/catestscores.dta) Error in read.dta(C:/Documents and Settings/Chris/Desktop/S4412/catestscores.dta, : a binary read error occurred There's little chance the data is corrupt considering it came from my professor and he used the data earlier. So, either I'm doing something wrong or R just doesn't like to read in Stata files. If it's a problem with R, how can I easily convert the file without purchasing Stata? R does read Stata files -- I use this facility frequently. It's hard to tell why it isn't working in your case, since we don't know anything about the file, your version of R, version of Stata, etc (we can guess you are on windows from the file name). The error message implies that the file was found, and that it started with the right sequence of bytes to be a Stata .dta file, but that something (probably the end of the file) prevented R from reading what it was expecting to read. This is why (in the absence of any further information) the natural suspicion is that the file is corrupt. It is possible that we have misunderstood some unusual possibility in the Stata file format -- this has happened once before -- but it is fairly well documented. In any case, there is not much that can be done without more information. -thomas __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] export to a dat file that SAS can read
On Wed, 13 Jun 2007, Uwe Ligges wrote: Rina Miehs wrote: Hello i have a data frame in R that some SAS users need to use in their programs, and they want it in a dat file, is that possible? What is a dat file? and which functions to use for that? I *guess* write.table() will do the trick, given dat is what I guess it is... Another approach (that preserves factor levels if you have them) is to use write.foreign in the 'foreign' package. This writes a text file for the data and also writes SAS code to read the file. -thomas __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Looking for R-code for non-negative matrix factorization in the presence of Gaussian or Poisson noise
On Mon, 11 Jun 2007, [EMAIL PROTECTED] wrote: Hi all, Has any of you implemented code for non-negative matrix factorization to solve Y=T P' +E; dim(Y)=n,p ; dim(T)=n,nc; dim (P)=(p,nc); dim(E)=n,p where T and P must be non-negative and E either Gaussian or Poisson noise. I'm looking for two variants: 1. Easy (I think), T is known (that is we just want to solve the general inverse problem) This is non-negative least squares, a quadratic programming problem, for which there is code (at least if n and nc are not too big) 2. Harder (?), T is unknown (under some restrictions) [as an intermediate, we may want to fix nc] Even with fixed nc this is Distinctly Non-trivial. It often isn't identifiable, for a start. I've encountered this problem in air pollution source apportionment, where people use an algorithm due to Paatero (1999) JCGS 8:854-8, which is a conjugate gradient algorithm that handles the constraints by creative abuse of preconditioning. The algorithm seems fairly well-behaved, although the statistical properties of the estimates are not well-understood [at least, I don't understand them, and I have simulations that appear to contradict the views of people who claim to understand them]. The difficulty probably depends on the size of the problem -- the air pollution problems have n~1000, p~20, nc~7, or larger. -thomas __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] name of the variable that will contain the result of a function
On Wed, 6 Jun 2007, Benilton Carvalho wrote: Hi everyone, say I have a function called 'foo', which takes the argument arg1. Is there any mechanism that I can use to learn about the variable where foo(arg1) is going to be stored? No. This information isn't available explicitly even at the C level. For example: x - foo(arg1) so, inside foo() I'd like to be able to get the string x. if, foo(arg1) was used insted, I'd like to get NA. It could be much worse that this, for example, x[[7]][y][[4]] - foo(arg1) w - foo(arg2)+1 names(x)[foo(arg3)] - foo(arg4) -thomas __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] summing up colum values for unique IDs when multiple ID's exist in data frame
On Tue, 29 May 2007, Seth Falcon wrote: Young Cho [EMAIL PROTECTED] writes: I have data.frame's with IDs and multiple columns. B/c some of IDs showed up more than once, I need sum up colum values to creat a new dataframe with unique ids. I hope there are some cheaper ways of doing it... Because the dataframe is huge, it takes almost an hour to do the task. Thanks so much in advance! Does this do what you want in a faster way? rowsum() should probably be faster (but perhaps not much). -thomas Thomas Lumley Assoc. Professor, Biostatistics [EMAIL PROTECTED] University of Washington, Seattle __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] normality tests [Broadcast]
On Mon, 28 May 2007, Martin Maechler wrote: LuckeJF == Lucke, Joseph F [EMAIL PROTECTED] on Fri, 25 May 2007 12:29:49 -0500 writes: LuckeJF Most standard tests, such as t-tests and ANOVA, LuckeJF are fairly resistant to non-normalilty for LuckeJF significance testing. It's the sample means that LuckeJF have to be normal, not the data. The CLT kicks in LuckeJF fairly quickly. Even though such statements appear in too many (text)books, that's just plain wrong practically: Even though *level* of the t-test is resistant to non-normality, the power is not at all!! And that makes the t-test NON-robust! While it is true that this makes the t-test non-robust, it doesn't mean that the statement is just plain wrong practically. The issue really is more complicated than a lot of people claim (not you specifically, Martin, but upthread and previous threads). Starting with the demonstrable mathematical facts: - lots of rank tests are robust in the sense of Huber - rank tests are optimal for specific location-shift testing problems. - lots of rank tests have excellent power for location shift alternatives over a wide range of underlying distributions. - rank tests fail to be transitive when stochastic ordering is not assumed (they are not consistent with any ordering on all distributions) - rank tests do not lead to confidence intervals unless a location shift or similar one-dimensional family is assumed - No rank test is uniformly more powerful than any parametric test or vice versa (if we rule out pathological cases) - there is no rank test that is consistent precisely against a difference in means - the t-test (and essentially all tests) can be made distribution-free in large samples (for small values of 'large', usually) - being distribution-free does not guarantee robustness of power (for the t-test or for any other test) Now, if we assume stochastic ordering is the Wilcoxon rank-sum test more or less powerful than the t-test? Everyone knows that this depends on the null hypothesis distribution. Fewer people seem to know that it also depends on the alternative, especially in large samples. Suppose the alternative of interest is not that the values are uniformly larger by 1 unit, but that 5% of them are about 20 units larger. The Wilcoxon test -- precisely because it gives less weight to outliers -- will have lower power. For example (ObR) one.sim-function(n, pct, delta){ x-rnorm(n) y-rnorm(n)+delta*rbinom(n,1,pct) list(x=x,y=y) } mean(replicate(100, {d-one.sim(100,.05,20); t.test(d$x,d$y)$p.value})0.05) mean(replicate(100, {d-one.sim(100,.05,20); wilcox.test(d$x,d$y)$p.value})0.05) mean(replicate(100, {d-one.sim(100,.5,1); t.test(d$x,d$y)$p.value})0.05) mean(replicate(100, {d-one.sim(100,.5,1); wilcox.test(d$x,d$y)$p.value})0.05) Since both relatively uniform shifts and large shifts of small fractions are genuinely important alternatives in real problems it is true in practice as well as in theory that neither the Wilcoxon nor the t-test is uniformly superior. This is without even considering violations of stochastic ordering -- which are not just esoteric pathologies, since it is quite plausible for a treatment to benefit some people and harm others. For example, I've seen one paper in which a Wilcoxon test on medical cost data was statistically significant in the *opposite direction* to the difference in means. This has been a long rant, but I keep encountering statisticians who think anyone who ever recommends a t-test just needs to have the number 0.955 quoted to them. snip LuckeJF Testing for normality prior to choosing a test LuckeJF statistic is generally not a good idea. Definitely. Or even: It's a very bad idea ... I think that's something we can all agree on. -thomas __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] convergence of coxfilter and coxph
On Mon, 21 May 2007, carol white wrote: Hi, coxfilter function in genefilter package uses coxph to fit a model to filter genes. how come that coxfilter could converge to find a solution in cox model fitting using a data matrix of 8000 variables and 600 samples but coxph doesn't converge with the same matrix? coxfilter() fits 8000 one-variable models, which works (for appropriate values of works). coxph() refuses to fit one 8000-variable model. -thomas __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Selecting complementary colours
On Mon, 21 May 2007, John Fox wrote: In retrospect, I didn't specify the problem clearly: What I want to be able to do is to place text on a background of arbitrary (but known RGB) colour so that the text is legible. I guess that this is better described as a contrasting than a complementary colour. Since luminance contrasts are necessary and sufficient for readable text, you could use white for dark colors and black for light colors. Luminance is roughly proportional to 0.2*(R^2.4)+0.6*(G^2.4), suggesting something like lightdark-function (color) { rgb - col2rgb(color)/255 L - c(0.2, 0.6, 0) %*% rgb ifelse(L = 0.2, #60, #A0) } This uses a pale yellow for dark backgrounds and a dark blue for light backgrounds, and it seems to work reasonably well. -thomas __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] svychisq
On Fri, 18 May 2007, Moss, Angela (Dudley PCT) wrote: Dear All I am trying to use svychisq with a two-dimensional table 4 x 5. The command I am using is summary(svytable(~dietperception+dietstatus,dudleyls1rake,na.rm=TRUE),C hisq) It is throwing up an error message as follows: Error in NCOL(y) : only 0's may be mixed with negative subscripts I can't reproduce this problem at all. I've tried tables with zero cells, with and without raking. The na.rm= argument to svytable() can't be helping, since svytable() doesn't have an na.rm argument. Does the same thing happen if you call svychisq() directly rather than via summary(svytable())? -thomas Thomas Lumley Assoc. Professor, Biostatistics [EMAIL PROTECTED] University of Washington, Seattle __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] saving datafreame object problem
On Tue, 22 May 2007, [EMAIL PROTECTED] wrote: Do I miss here something? Yes. dtaa = read.table(http://www.ats.ucla.edu/stat/mplus/examples/ma_snijders/mlbook1.dat;, sep=,) head(dtaa) # shows the data as it should be save(dtaa,dtaa,file=c:/dtaa) d = load(c:/dtaa) From ?load Value: A character vector of the names of objects created, invisibly. So d is correct. Try ls() to find the loaded data. -thomas __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] lapply not reading arguments from the correct environment
On Fri, 18 May 2007, jiho wrote: Hello, I am facing a problem with lapply which I think''' may be a bug. This is the most basic function in which I can reproduce it: myfun - function() { foo = data.frame(1:10,10:1) foos = list(foo) fooCollumn=2 cFoo = lapply(foos,subset,select=fooCollumn) return(cFoo) } snip I get this error: Error in eval(expr, envir, enclos) : object fooCollumn not found while fooCollumn is defined, in the function, right before lapply. snip This is with R 2.5.0 on both OS X and Linux (Fedora Core 6) What did I do wrong? Is this indeed a bug? An intended behavior? No, it isn't a bug (though it may be confusing). The problem is that subset() evaluates its select argument in an unusual way. Usually the argument would be evaluated inside myfun() and the value passed to lapply(), and everything would work as you expect. subset() bypasses the normal evaluation and explicitly evaluates the select argument in the calling frame, ie, inside lapply(), where fooCollumn is not visible. You could do lapply(foos, function(foo) subset(foo, select=fooCollum)) capturing fooCollum by lexical scope. In R this is often a better option than passing extra arguments to lapply (or other functions that take function arguments). -thomas __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Scoped options setting?
On Wed, 16 May 2007, Zack Weinberg wrote: Is there any way to set options during the evaluation of a particular expression, with them automatically reset when control leaves that expression, however that happens? Kind of like let on a special variable does in Lisp. I naively tried You could write with_options() as with_options -function(optionlist, expr){ oldoptions-options(optionlist) on.exit(options(oldoptions)) eval(substitute(expr), parent.frame()) } and then do with_options(list(warn=-1), whatever) -thomas __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] [OT] Is data copyrightable?
This is an area where US law differs importantly from other countries. US law protects compilations of facts only to the extent that the selection of the facts is creative expression (and does not protect the facts themselves). Many other jurisdictions (eg European Union) also offer protection based on the effort need to compile the facts regardless of any creativity. A 1997 US Supreme Court decision (in a case about telephone directories) ruled that the 'sweat of the brow' rationale for copyright was inconsistent with the intellectual property clause of the US Constitution. So, in the US, it depends on the data and their source. Publishers that I have talked to tend to claim that data are definitely copyrightable, but since they tend to own the copyrights one might do well to recall the immortal words of Mandy Rice-Davies. -thomas Thomas Lumley Assoc. Professor, Biostatistics [EMAIL PROTECTED] University of Washington, Seattle On Sat, 12 May 2007, hadley wickham wrote: Dear all, This is a little bit off-topic, but I was wondering if anyone has any informed opinion on whether data (ie. a dataset) is copyrightable? Hadley __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Thomas Lumley Assoc. Professor, Biostatistics [EMAIL PROTECTED] University of Washington, Seattle __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] PRESS criterion in leaps
On Fri, 11 May 2007, Andrew Smith wrote: I thought it would be simplest to build on already existing functions like regsubsets in package leaps. It's easy enough to calculate the PRESS criterion for a fitted lm object, but I'm having trouble deciphering the structure of the regsubsets objects that leaps works with. Is there any way to calculate press from a regsubsets? Or, to put it another way, can I get the residual vector and the diagonal entries of the hat matrix from a regsubsets object? In fact, if the hat matrix is never calculated explicitly, the columns of Q from the QR factorization would suffice. Not only is the hat matrix never calculated explicitly, the Q matrix isn't calculated either. The code forms R and Q^TY directly (the same code is used in the biglm package to provide bounded-space linear regression). -thomas __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Mantel-Haenszel relative risk with Greenland-Robins variance estimate
On Tue, 8 May 2007, Frank E Harrell Jr wrote: Does anyone know of an R function for computing the Greenland-Robins variance for Mantel-Haenszel relative risks? Both the (almost identical) meta-analysis packages compute the MH estimator for relative risks and its standard error. -thomas Thomas Lumley Assoc. Professor, Biostatistics [EMAIL PROTECTED] University of Washington, Seattle __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] power 2x3 exact test
On Thu, 10 May 2007, [EMAIL PROTECTED] wrote: Given that you expect some cells to be small, it should not be a severe task to draw up a list of (a1,b1) values which correspond to rejection of the null hypothesis (that both ORs equal 1), and then the simulation using different values of the two odds-ratios will give you the power for each such pair of odds-ratios. The main technical difficulty will be simulation of random tables, conditional on the marginals, with the probabilities as given above. I don't know of a good suggestion for this. r2dtable(). If this is a power calculation, though, you probably want to fix only one margin, which is a much simpler problem, and if the table is not too large it would not be difficult to compute the exact probability for each element of the sample space and so get exact power. -thomas Thomas Lumley Assoc. Professor, Biostatistics [EMAIL PROTECTED] University of Washington, Seattle __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] reference in article
On Wed, 2 May 2007, Tomas Mikoviny wrote: Hi all R positive, does anyone know how to refer R in article? Every time you start R it says (in part) R is a collaborative project with many contributors. Type 'contributors()' for more information and 'citation()' on how to cite R or R packages in publications. -thomas __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] GLS terminology question not related to R
On Wed, 25 Apr 2007, Leeds, Mark (IED) wrote: This is a terminology question not related to R. The literature often says that OLS is inefficient relative to GLS if the residuals in the system are correlated ( and the RHS sides of each are not identical ). Does this mean that OLS overestimates residual and coefficient variances , underestimates them or just gets them wrong and the direction is not known ? Thanks. It does not mean either. It means that the true variance of the OLS estimates is greater than the true variance of the GLS estimates. A separate issue is whether the estimated variance of an OLS estimator is greater or less than the true variance of the OLS estimator. This can go either way. -thomas __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] general question about plotting multiple regression results
On Thu, 19 Apr 2007, Simon Pickett wrote: Hi all, I have been bumbling around with r for years now and still havent come up with a solution for plotting reliable graphs of relationships from a linear regression. termplot() does this for a range of regression models (without interaction terms). The effects package does it better for linear regression models. -thomas Here is an example illustrating my problem 1.I do a linear regression as follows summary(lm(n.day13~n.day1+ffemale.yell+fmale.yell+fmale.chroma,data=surv)) which gives some nice sig. results Coefficients: Estimate Std. Error t value Pr(|t|) (Intercept) -0.739170.43742 -1.690 0.093069 . n.day11.004600.05369 18.711 2e-16 *** ffemale.yell 0.224190.06251 3.586 0.000449 *** fmale.yell0.258740.06925 3.736 0.000262 *** fmale.chroma 0.235250.11633 2.022 0.044868 * 2. I want to plot the effect of ffemale.yell, fmale.yell and fmale.chroma on my response variable. So, I either plot the raw values (which is fine when there is a very strong relationship) but what if I want to plot the effects from the model? In this case I would usually plot the fitted values values against the raw values of x... Is this the right approach? fit-fitted(lm(n.day13~n.day1+ffemale.yell+fmale.yell+fmale.chroma,data=fsurv1)) plot(fit~ffemale.yell) #make a dummy variable across the range of x x-seq(from=min(fsurv1$ffemale.yell),to=max(fsurv1$ffemale.yell), length=100) #get the coefficients and draw the line co-coef(lm(fit~ffemale.yell,data=fsurv1)) y-(co[2]*x)+co[1] lines(x,y, lwd=2) This often does the trick but for some reason, especially when my model has many terms in it or when one of the independent variables is only significant when the other independent variables are in the equation, it gives me strange lines. Please can someone show me the light? Thanks in advance, Simon. Simon Pickett PhD student Centre For Ecology and Conservation Tremough Campus University of Exeter in Cornwall TR109EZ Tel 01326371852 __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Thomas Lumley Assoc. Professor, Biostatistics [EMAIL PROTECTED] University of Washington, Seattle __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] help comparing two median with R
On Tue, 17 Apr 2007, Robert McFadden wrote: -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Jim Lemon Sent: Tuesday, April 17, 2007 12:37 PM To: Pedro A Reche Cc: r-help@stat.math.ethz.ch Subject: Re: [R] help comparing two median with R Pedro A Reche wrote: Dear R users, I am new to R and I would like to ask your help with the following topic. I have three sets of numeral data, 2 sets are paired and a third is independent of the other two. For each of these sets I have obtained their basic statistics (mean, median, stdv, range ...). Now I want to compare if these sets differ. I could compare the mean doing a basic T test . However, I was looking for a test to compare the medians using R. If that is possible I would love to hear the specifics. Hi Pedro, You can use the Mann-Whitney test (wilcox with two samples), but you would have to check that the second and third moments of the variable distributions were the same, I think. Jim Use Mann-Whitney U test, but remember about 2 assumption: 1. samples come from continuous distribution (there are no tied obserwations) 2. distributions are identical in shape. It's very similar to t-test but Mann-Whitney U test is not as affected by violation of the homogeneity of variance assumption as t-test is. This turns out not to be quite correct. If the two distributions differ only by a location shift then the hypothesis that the shift is zero is equivalent to the medians being the same (or the means, or the 3.14159th percentile), and the Mann-Whitney U test will test this hypothesis. Otherwise the Mann-Whitney U test does not test for equal medians. The assumption that the distributions are continuous is for convenience -- it makes the distribution of the test statistic easier to calculate and otherwise R uses a approximation. The assumption of a location shift is critical -- otherwise it is easy to construct three data sets x,y,z so that the Mann-Whitney U test thinks x is larger than y, y is larger than z and z is larger than x (Google for Efron Dice). That is, the Mann-Whitney U test cannot be a test for any location statistic. There actually is an exact test for the median that does not assume a location shift: dichotomize your data at the pooled median to get a 2x2 table of above/below median by group, and do Fisher's exact test on the table. This is almost never useful (because it doesn't come with an interval estimate), but is interesting because it (and the generalizations to other quantiles) is the only exactly distribution-free location test that does not have the 'non-transitivity' problem of the Mann-Whitney U test. I believe this median test is attributed to Mood, but I have not seen the primary source. -thomas __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] help comparing two median with R
On Tue, 17 Apr 2007, Frank E Harrell Jr wrote: The points that Thomas and Brian have made are certainly correct, if one is truly interested in testing for differences in medians or means. But the Wilcoxon test provides a valid test of x y more generally. The test is consonant with the Hodges-Lehmann estimator: the median of all possible differences between an X and a Y. Yes, but there is no ordering of distributions (taken one at a time) that agrees with the Wilcoxon two-sample test, only orderings of pairs of distributions. The Wilcoxon test provides a test of xy if it is known a priori that the two distributions are stochastically ordered, but not under weaker assumptions. Otherwise you can get xyzx. This is in contrast to the t-test, which orders distributions (by their mean) whether or not they are stochastically ordered. Now, it is not unreasonable to say that the problems are unlikely to occur very often and aren't worth worrying too much about. It does imply that it cannot possibly be true that there is any summary of a single distribution that the Wilcoxon test tests for (and the same is true for other two-sample rank tests, eg the logrank test). I know Frank knows this, because I gave a talk on it at Vanderbilt, but most people don't know it. (I thought for a long time that the Wilcoxon rank-sum test was a test for the median pairwise mean, which is actually the R-estimator corresponding to the *one*-sample Wilcoxon test). -thomas __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] software comparison
On Mon, 16 Apr 2007, Philippe Grosjean wrote: Hello, I just read this paper, and was surprised by: This study performs a comparison of the latest versions of nine different statistical software packages using StRD. These packages are SAS (9.1), SPSS (12.0), Excel (2003), Minitab (14.0), Stata (8.1), Splus (6.2), R (1.9.1), JMP (5.0), and StatCrunch (3.0). For a paper published in 2007, and submitted in April 2005, this is still surprising. If I my calculation is correct, in 2004, they would have used R 2.2.x, or something,... not 1.9.1? No -- there is a new x.y.0 twice per year. Checking the r-announce archives shows that 2.0.0 came out in October 2004 and 1.9.1 in June 2004. Describing 1.9.1 as the latest version was inaccurate, but it may well have been a more recent version than for most of the other packages they examined. -thomas Thomas Lumley Assoc. Professor, Biostatistics [EMAIL PROTECTED] University of Washington, Seattle __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] read.spss (package foreign) and SPSS 15.0 files
On Thu, 5 Apr 2007, John Kane wrote: Heck. I'd be happy to get an answer to what is happening here: mac - spss.get(H:/ONTH/Raw.data/Follow.sav) Warning message: H:/ONTH/Raw.data/Follow.sav: Unrecognized record type 7, subtype 16 encountered in system file It means that your file had a record of type 7, subtype 16 in it, and read.spss doesn't know how to handle these. You would have to ask SPSS what record type 7 and subtype 16 represent -- their software put them there, and it's their terminology. People's experience with unrecognised record types is that they usually don't matter, which would make sense from a backwards-compatibility point of view, but in the absence of documentation or psychic powers it is hard to be sure. Avoiding read.spss is a perfectly reasonable strategy, and is in fact what we have always recommended in the Data Import-Export manual. AFAIK the only commercial statistical software vendor that does provide complete, public documentation of their file formats is Stata, and this is one reason why there are fewer complaints about read.dta and write.dta. It also probably helps that the code was written by someone who uses Stata -- there hasn't been much contribution of code or patches for the foreign package from SPSS users. -thomas __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] read.spss (package foreign) and SPSS 15.0 files
On Thu, 5 Apr 2007, John Kane wrote: Heck. I'd be happy to get an answer to what is happening here: mac - spss.get(H:/ONTH/Raw.data/Follow.sav) Warning message: H:/ONTH/Raw.data/Follow.sav: Unrecognized record type 7, subtype 16 encountered in system file It means that your file had a record of type 7, subtype 16 in it, and read.spss doesn't know how to handle these. You would have to ask SPSS what record type 7 and subtype 16 represent -- their software put them there, and it's their terminology. People's experience with unrecognised record types is that they usually don't matter, which would make sense from a backwards-compatibility point of view, but in the absence of documentation or psychic powers it is hard to be sure. Avoiding read.spss is a perfectly reasonable strategy, and is in fact what we have always recommended in the Data Import-Export manual. AFAIK the only commercial statistical software vendor that does provide complete, public documentation of their file formats is Stata, and this is one reason why there are fewer complaints about read.dta and write.dta. It also probably helps that the code was written by someone who uses Stata -- there hasn't been much contribution of code or patches for the foreign package from SPSS users. -thomas __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Using Sampling Weights in R
On Tue, 10 Apr 2007, Thomas W. Volscho wrote: Dear List, I have a dataset that provides sampling weights (National Survey of Family Growth 2002). I want to produce a cross-tabulation and use the provided sampling weights to obtain representative population estimates. (I believe they are simply frequency weights but codebook is uninformative). They are almost certainly not simply frequency weights -- the NCHS web page on this survey describes a multistage sampling scheme and gives code examples for survey software using the design features. If you only want point estimates and no intervals or p-values then it doesn't matter what type of weights they are. I can reproduce results (using this data) that were reported in a recent journal article, if I use SPSS or STATA--specifying the weight. The survey package does analysis of surveys of this sort. Judging from the Stata example at http://www.cdc.gov/nchs/data/nsfg/Ser2_Example1_FINAL.pdf if the data are in a data frame called 'nsfg' you would create a survey design object with dnsfg - svydesign(id=~SECU_R, weight=~FINALWGT, strata=~SEST, data=nsfg) and then you could get crosstabs with,eg, svytable(~agerx+pill, design=dnsfg) -thomas __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Reasons to Use R
On Wed, 11 Apr 2007, Alan Zaslavsky wrote: I have thought for a long time that a facility for efficient rowwise calculations might be a valuable enhancement to S/R. The storage of the object would be handled by a database and there would have to be an efficient interface for pulling a row (or small chunk of rows) out of the database repeatedly; alternatively the operatons could be conducted inside the database. Basic operations of rowwise calculation and cumulation (such as forming a column sum or a sum of outer-products) would be written in an R-like syntax and translated into an efficient set of operations that work through the database. (Would be happy to share some jejeune notes on this.) However the main answer to thie problem in the R world seems to have been Moore's Law. Perhaps somebody could tell us more about the S-Plus large objects library, or the work that Doug Bates is doing on efficient calculations with large datasets. I have been surprised to find how much you can get done in SQL, only transferring summaries of the data into R. There is soon going to be an experimental surveyNG package that works with survey data stored in a SQLite database without transferring the whole thing into R for most operations (and I could get further if SQLite had the log() and exp() functions that most other SQL implementations for large databases provide). I'll be submitting a paper on this to useR2007. The approach of transferring blocks of data into R and using a database just as backing store will allow more general computation but will be less efficient than performing the computation in the database, so a mixture of both is likely to be helpful. Moore's Law will settle some issues, but there are problems where it is working to increase the size of datasets just as fast as it increases computational power. -thomas Thomas Lumley Assoc. Professor, Biostatistics [EMAIL PROTECTED] University of Washington, Seattle __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] read.spss (package foreign) and SPSS 15.0 files
On Thu, 5 Apr 2007, John Kane wrote: Heck. I'd be happy to get an answer to what is happening here: mac - spss.get(H:/ONTH/Raw.data/Follow.sav) Warning message: H:/ONTH/Raw.data/Follow.sav: Unrecognized record type 7, subtype 16 encountered in system file It means that your file had a record of type 7, subtype 16 in it, and read.spss doesn't know how to handle these. You would have to ask SPSS what record type 7 and subtype 16 represent -- their software put them there, and it's their terminology. People's experience with unrecognised record types is that they usually don't matter, which would make sense from a backwards-compatibility point of view, but in the absence of documentation or psychic powers it is hard to be sure. Avoiding read.spss is a perfectly reasonable strategy, and is in fact what we have always recommended in the Data Import-Export manual. AFAIK the only commercial statistical software vendor that does provide complete, public documentation of their file formats is Stata, and this is one reason why there are fewer complaints about read.dta and write.dta. It also probably helps that the code was written by someone who uses Stata -- there hasn't been much contribution of code or patches for the foreign package from SPSS users. -thomas __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] unexpected behavior when creating a list of functions
On Wed, 4 Apr 2007, Matthew Suderman wrote: I wanted to create a list of functions whose output differs depending the value of a variable when the function was created. Generally this did not work. Each function was exactly the same, as in the simple example below: get_data_function - function(v) { function() { print(v) } } data_functions - sapply(1:10,function(v) get_data_function(v)) (data_functions[[1]])() # prints 10! However, if I insert a statement in get_data_function to print argument v, then I get the different functions that I wanted: get_data_function - function(v) { print(v) function() { print(v) } } data_functions - sapply(1:10,function(v) get_data_function(v)) (data_functions[[1]])() # prints 1, as expected! I have two questions about this: * Is this a bug in R? No, it's lazy evaluation at work. * Is there a more direct way to get the effect of printing v? The function force() is designed for this -- it doesn't do anything special that any other evaluation wouldn't do, but it makes clear that all you are doing is forcing evaluation. -thomas __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] strange fisher.test result
On Mon, 2 Apr 2007, [EMAIL PROTECTED] wrote: From the above, the marginal totals for his 2x2 table a b = 168 c d 15 24 are (rows then columns) 24,39,31,32 These fixed marginals mean that the whole table is determined by the value of a. The following function P.FX() computes the probabilities of all possible tables, conditional on the marginal totals (it is much more transparent than the code for the same purpose in fisher.test()): As this example has shown, 2x2 tables are a nice opportunity for illustrating how the ordering of the sample space affects inference (because you can actually see the whole sample space). I used something like this as a term project in an introductory R class, where we wrote code to compute the probabilities for all outcomes conditional on one margin, and used this to get (conservative) exact versions of all the popular tests in 2x2 tables. It's interesting to do things like compare the rejection regions and power under various alternatives for the exact versions of the likelihood ratio test and Fisher's test. We didn't get as far as confidence intervals, but the code is at http://faculty.washington.edu/tlumley/b514/exacttest.R with .Rd files at http://faculty.washington.edu/tlumley/b514/man/ [credits: this is all based on ideas from Scott Emerson] -thomas Thomas Lumley Assoc. Professor, Biostatistics [EMAIL PROTECTED] University of Washington, Seattle __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Tail area of sum of Chi-square variables
On Thu, 29 Mar 2007, Klaus Nordhausen wrote: I was wondering if there are any R functions that give the tail area of a sum of chisquare distributions of the type: a_1 X_1 + a_2 X_2 where a_1 and a_2 are constants and X_1 and X_2 are independent chi-square variables with different degrees of freedom. pchisqsum() in the survey package does this. -thomas Thomas Lumley Assoc. Professor, Biostatistics [EMAIL PROTECTED] University of Washington, Seattle __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Tail area of sum of Chi-square variables
The Satterthwaite approximation is surprisingly good, especially in the most interesting range in the right tail (say 0.9 to 0.999). There is also another, better, approximation with a power of a chi-squared distribution that has been used in the survey literature. However, since it is easy to write down the characteristic function and perfectly feasible to invert it by numerical integration, we might as well use the right answer. -thomas On Thu, 29 Mar 2007, S Ellison wrote: I was wondering if there are any R functions that give the tail area of a sum of chisquare distributions of the type: a_1 X_1 + a_2 X_2 where a_1 and a_2 are constants and X_1 and X_2 are independent chi-square variables with different degrees of freedom. You might also check out Welch and Satterthwaite's (separate) papers on effective degrees of freedom for compound estimates of variance, which led to a thing called the welch-satterthwaite equation by one (more or less notorious, but widely used) document called the ISO Guide to Expression of Uncertainty in Measurement (ISO, 1995). The original papers are B. L. Welch, J. Royal Stat. Soc. Suppl.(1936) 3 29-48 B. L. Welch, Biometrika, (1938) 29 350-362 B. L. Welch, Biometrika, (1947) 34 28-35 F. E. Satterthwaite, Psychometrika (1941) 6 309-316 F. E. Satterthwaite, Biometrics Bulletin, (1946) 2 part 6 110-114 The W-S equation - which I believe is a special case of Welch's somewhat more general treatment - says that if you have multiple independent estimated variances v[i] (could be more or less equivalent to your a_i X_i?) with degrees of freedom nu[i], the distribution of their sum is approximately a scaled chi-squared distribution with effective degrees of freedom nu.effective given by nu.effective = sum(v[i])^2 / sum((v[i]^2)/nu[i] ) If I recall correctly, with an observed variance s^2 (corresponding to the sum(v[i] above if those are observed varianes), nu*(s^2 /sigma^2) is distributed as chi-squared with degrees of freedom nu, so the scaling factor for quantiles would come out of there (depending whether you're after the tail areas for s^2 given sigma^2 or for a confidence interval for sigma^2 given s^2) However, I will be most interested to see what a more exact calculation provides! Steve Ellison *** This email and any attachments are confidential. Any use, co...{{dropped}} __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Thomas Lumley Assoc. Professor, Biostatistics [EMAIL PROTECTED] University of Washington, Seattle __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] what is the difference between survival analysis and (...)
On Wed, 28 Mar 2007, Frank E Harrell Jr wrote: Eric Elguero wrote: Hi everybody, recently I had to teach a course on Cox model, of which I am not a specialist, to an audience of medical epidemiologists. Not a good idea you might say.. anyway, someone in the audience was very hostile. At some point, he sayed that Cox model was useless, since all you have to do is count who dies and who survives, divide by the sample sizes and compute a relative risk, and if there was significant censoring, use cumulated follow-up instead of sample sizes and that's it! I began arguing that in Cox model you could introduce several variables, interactions, etc, then I remembered of logistic models ;-) The only (and poor) argument I could think of was that if mr Cox took pains to devise his model, there should be some reason... That is a very ignorant person, concerning statistical efficiency/power/precision and how to handle incomplete follow-up (variable follow-up duration). There are papers in the literature (I wish I had them at my fingertips) that go into the efficiency loss of just counting events. If the events are very rare, knowing the time doesn't help as much, but the Cox model still can handle censoring correctly and that person's approach doesn't. Certainly just counting the events is inefficient -- the simplest example would be studies of some advanced cancers where nearly everyone dies during followup, so that there is little or no censoring but simple counts are completely uninformative. It's relatively hard to come up with an example where using the total-time-on-test (rather than sample size) as a denominator is much worse than the Cox mode, though. You need the baseline hazard to vary a lot over time and the censoring patterns to be quite different in the groups, but proportional hazards to still hold. I think the advantages of the Cox model over a reasonably sensible person-time analysis are real, but not dramatic -- it would be hard to find a data set that would convince the sort of person who would make that sort of claim. I would argue that computational convenience on the one hand, and the ability to exercise lots of nice mathematical tools on the other hand have also contributed to the continuing popularity of the Cox model. -thomas Thomas Lumley Assoc. Professor, Biostatistics [EMAIL PROTECTED] University of Washington, Seattle __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Replacement in an expression - can't use parse()
On Tue, 27 Mar 2007, Peter Dalgaard wrote: The way around this is to add a further layer of substitute() to insert the value of e: eval(substitute(substitute(call,list(u2=quote(x),u3=1)),list(call=e[[1]]))) u1 + x + 1 Or eval(do.call(substitute, list(e[[1]], list(u2=quote(x),u3=1))) -thomas __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] subset
On Mon, 26 Mar 2007, Marc Schwartz wrote: Sergio, Please be sure to cc: the list (ie. Reply to All) with follow up questions. In this case, you would use %in% with a negation: NewDF - subset(DF, (var1 == 0) (var2 == 0) (!var3 %in% 2:3)) Probably a typo: should be !(var3 %in% 2:3) rather than (!var %in% 2:3) -thomas See ?%in% for more information. HTH, Marc On Mon, 2007-03-26 at 17:30 +0200, Sergio Della Franca wrote: Ok, this run correctly. Another question for you: I want to put more than one condition for var3, i.e.: I like to create a subset when: - var1=0 - var2=0 - var3 is different from 2 and from 3. Like you suggested, i perform this code: NewDF - subset(DF, (var1 == 0) (var2 == 0) (var 3 != 2)) (var 3 != 3)) There is a method to combine (var 3 != 2)) (var 3 != 3)) in one condition? Thank you. Sergio 2007/3/26, Marc Schwartz [EMAIL PROTECTED]: On Mon, 2007-03-26 at 17:02 +0200, Sergio Della Franca wrote: Dear R-Helpers, I want to make a subset from my data set. I'd like to perform different condition for subset. I.e.: I like to create a subset when: - var1=0 - var2=0 - var3 is different from 2. How can i develop a subset under this condition? Thank you in advance. Sergio Della Franca. See ?subset Something along the lines of the following should work: NewDF - subset(DF, (var1 == 0) (var2 == 0) (var 3 != 0)) HTH, Marc Schwartz __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Thomas Lumley Assoc. Professor, Biostatistics [EMAIL PROTECTED] University of Washington, Seattle __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Hardware for a new Workstation for best performance using R
On Thu, 15 Mar 2007, Andrew Perrin wrote: (in part) 2.) Yes, by all means you should use linux instead of windows. The graphics output is completely compatible with whatever applications you want to paste them into on Windows. This turns out not to be the case. It is not trivial to produce good graphics off Windows for adding to Microsoft Office documents (regrettably an important case for many people). There has been much discussion of this on the R-sig-mac mailing list, for example, where PNG bitmaps (at sufficiently high resolution) seem to be the preferred method. -thomas Thomas Lumley Assoc. Professor, Biostatistics [EMAIL PROTECTED] University of Washington, Seattle __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Hardware for a new Workstation for best performance using R
On Mon, 19 Mar 2007, Gabor Grothendieck wrote: On 3/19/07, Thomas Lumley [EMAIL PROTECTED] wrote: On Thu, 15 Mar 2007, Andrew Perrin wrote: (in part) 2.) Yes, by all means you should use linux instead of windows. The graphics output is completely compatible with whatever applications you want to paste them into on Windows. This turns out not to be the case. It is not trivial to produce good graphics off Windows for adding to Microsoft Office documents (regrettably an important case for many people). There has been much discussion of this on the R-sig-mac mailing list, for example, where PNG bitmaps (at sufficiently high resolution) seem to be the preferred method. On Windows one can produce metafile output directly from R. Yes, indeed. However, this fact is of limited help when working on another operating system, which was the focus of the original question. -thomas __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Mac vs. PC
On Fri, 9 Mar 2007, Richard Morey wrote: 1. R on Mac was compiled with optimizations for the CPU, with R for Windows was not. I could test this by compiling R with the Intel compiler, or GCC with optimizations, and seeing if I get a significant speed boost. Yes. The Mac distribution uses Apple's linear algebra library, which is based on ATLAS and uses both cores. The default Windows distribution doesn't use an optimized linear algebra library because there isn't one built in to Windows. You can use ATLAS with the Windows distribution and there are even precompiled DLLs around somewhere. 2. His R is 64 bit, while mine is for 32 bit windows. (I'm not sure how much of a diference that makes, or whether OSX is 64 bit.) No. His R isn't 64bit. It would probably be slower if it were. The main reason to want 64bit R is to use lots of memory rather than to be fast. 3. Data is getting swapped to the hard drive, and my hard drive is slower than his. I chose a slower hard drive to get bigger capacity for the price. This could be true in principle, but I don't think the matrices are large enough for it to be the main factor. His computer won't be twice as fast on most R tasks (though it will still be twice as pretty, of course). -thomas Thomas Lumley Assoc. Professor, Biostatistics [EMAIL PROTECTED] University of Washington, Seattle __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] GLM: order of terms in model
This is a FAQ 7.18 Why does the output from anova() depend on the order of factors in the model? -thomas On Fri, 9 Mar 2007, Christian Landry wrote: Dear R-helpers, I have been analysing data using a GLM. My model is as follows: mod - glm (V ~ T + as.factor(A) + N, family=gaussian) and using anova(mod, test=F) to get the analysis of deviance table and the fraction of deviance explained by each term. T and A dominate with respect to their Deviance, with T having a larger effect than A (about twice) However, if I reverse T and A in the model, I get that A now explains more deviance than T. My questions are: 1) What is it due to? 2) Is there any way around this? How do I find which model is best and/or can I use another method that won't be sensitive to the order of the terms. Thanks, Christian Reply to: [EMAIL PROTECTED] __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Thomas Lumley Assoc. Professor, Biostatistics [EMAIL PROTECTED] University of Washington, Seattle __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Memory error
On Thu, 8 Mar 2007, Andrew Perrin wrote: Greetings- Running R 2.4.0 under Debian Linux, I am getting a memory error trying to read a very large file: library(foreign) oldgrades.df - read.spss('Individual grades with AI (Nov 7 2006).sav',to.data.frame=TRUE) Error: cannot allocate vector of size 10826 Kb Your file on disk seems to be about 300Mb, and it might well be larger in R, so it's probably too big for 32-bit R. However, you could try to.data.frame=FALSE in the read.spss() call. Based on memory profiling of the fairly similar read.dta() function I would guess that as.data.frame.list() might well be the culprit. -thomas __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Mitools and lmer
Doug, It's mitools, not mltools. I wrote it. I think the problem is just that coef() is not the right function for getting the fixed effects. Beth wants betas - MIextract(model0, fun=fixef) -thomas On Sat, 3 Mar 2007, Douglas Bates wrote: On 3/2/07, Beth Gifford [EMAIL PROTECTED] wrote: Hey there I am estimating a multilevel model using lmer. I have 5 imputed datasets so I am using mitools to pool the estimates from the 5 datasets. Everything seems to work until I try to use MIcombine to produced pooled estimates. Does anyone have any suggestions? The betas and the standard errors were extracted with no problem so everything seems to work smoothly up until that point. I'm not familiar with the mltools package and I didn't see it listed in the CRAN packages. Can you provide a reference or a link to the package? Program #Read data data.dir-system.file(dta,package=mitools) files.imp-imputationList(lapply(list.files(data.dir, pattern=imp.\\.dta, full=TRUE), read.dta)) #estimate model over each imputed dataset model0-with(files.imp,lmer( erq2tnc ~1+trt2+nash+wash+male+coh2+coh3+(1 | sitebeth))) #extract betas and standard errors betas-MIextract(model0,fun=coef) vars-MIextract(model0,fun=vcov) #Combine the results summary(MIcombine(betas,vars)) Error in cbar + results[[i]] : non-numeric argument to binary operator Error in summary(MIcombine(betas, vars)) : error in evaluating the argument 'object' in selecting a method for function 'summary' First use traceback() to discover where the (first) error occurred. My guess is that Mlcombine expects a particular type of object for the vars argument and it is not getting that type (and not checking for the correct type). Thanks Beth [[alternative HTML version deleted]] __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Thomas Lumley Assoc. Professor, Biostatistics [EMAIL PROTECTED] University of Washington, Seattle __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Timings of function execution in R [was Re: R in Industry]
On 2/9/07, Prof Brian Ripley [EMAIL PROTECTED] wrote: The other reason why pmin/pmax are preferable to your functions is that they are fully generic. It is not easy to write C code which takes into account that , [, [- and is.na are all generic. That is not to say that it is not worth having faster restricted alternatives, as indeed we do with rep.int and seq.int. Anything that uses arithmetic is making strong assumptions about the inputs. It ought to be possible to write a fast C version that worked for atomic vectors (logical, integer, real and character), but is there any evidence of profiled real problems where speed is an issue? I had an example just last month of an MCMC calculation where profiling showed that pmax(x,0) was taking about 30% of the total time. I used function(x) {z - x0; x[z] - 0; x} which was significantly faster. I didn't try the arithmetic solution. Also, I didn't check if a solution like this would still be faster when both arguments are vectors (but there was a recent mailing list thread where someone else did). -thomas Thomas Lumley Assoc. Professor, Biostatistics [EMAIL PROTECTED] University of Washington, Seattle __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] R in Industry
On Tue, 6 Feb 2007, Muenchen, Robert A (Bob) wrote: That sounds like a good idea. The name R makes it especially hard to find job postings, resumes or do any other type of search. Googling resume+sas or job opening+sas is quick and fairly effective (less a few airline jobs). Doing that with R is of course futile. At the risk of getting flamed, it's too bad it's not called something more unique such as Rpackage, Rlanguage, etc. For all sorts of reasons I don't think Googling for jobs using R was high on Ross Robert's list of use cases when they chose the name ... It might be better to have an archived list rather than a CRAN page -- I've just noticed that cran.us last updated on Jan 12, which would be a long delay for job ads. -thomas __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Two ways to deal with age in Cox model
On Mon, 5 Feb 2007, John Sorkin wrote: When running a cox proportional hazards model ,there are two ways to deal with age, including age as a covariate, or to include age as part of the follow-up time, viz, snip I would appreciate any thoughts about the differences in the interpretation of the two models. One obvious difference is that in the first model (fitagecovariate) one can make inferences about age and in the second one cannot. I think a second difference may be that in the first model the riskfactor is assumed to have values measured at the values of age where as in the second model riskfactor is assumed to have given values throughout the subject's life. There are even more possibilities (a nice example and discussion is in Breslow Day, the example being occupational exposure to nickel and later cancer). The Cox model works by comparing covariates for the observation that failed and other observations at risk at the same time, so the comparisons are entirely within time-point. If you use time since start of study you are comparing people with different covariates at the same time since start of study. If you use calendar time you are comparing people with different covariates at the same calendar time If you use age you are comparing people with different covariates at the same age. In an observational study it often is more important to control for age or for calendar time than for time since the study started, so these might be better time scales. A disadvantage in some studies with longitudinal data is that on the study time scale everyone may have measurements at the same time but on other time scales everyone may have measurements at different times. -thomas __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] read.spss and encodings
Thanks for the example file. I'm trying to work out what, if anything, is easy to support reliably for encodings in SPSS. -thomas On Sun, 4 Feb 2007, I. Soumpasis wrote: HI! This mail is related to Thomas mail so I follow up. I use Greek language and the spss files with value labels containing greek characters can not be imported with read.spss. I am on: sessionInfo() R version 2.5.0 Under development (unstable) (2007-02-01 r40632) i686-pc-linux-gnu locale: LC_CTYPE=el_GR.UTF-8;LC_NUMERIC=C;LC_TIME=el_GR.UTF-8;LC_COLLATE=el_GR.UTF-8;LC_MONETARY=el_GR.UTF-8;LC_MESSAGES=el_GR.UTF-8;LC_PAPER=el_GR.UTF-8;LC_NAME=el_GR.UTF-8;LC_ADDRESS=el_GR.UTF-8;LC_TELEPHONE=el_GR.UTF-8;LC_MEASUREMENT=el_GR.UTF-8;LC_IDENTIFICATION=el_GR.UTF-8 The following files are small examples used below: http://users.forthnet.gr/the/isoumpasis/data/1.sav http://users.forthnet.gr/the/isoumpasis/data/12.savhttp://users.forthnet.gr/the/isoumpasis/data/12.RData The first file has english value labels and can be read: read.spss(~/Desktop/1.sav) $VAR1 [1] \xf3\xf0\xdf\xf4\xe9\xf3\xf0\xdf\xf4\xe9 [3] \xf3\xf0\xdf\xf4\xe9\xf3\xf0\xdf\xf4\xe9 [5] \xf3\xf0\xdf\xf4\xe9\xe3\xf1\xe1\xf6\xe5\xdf\xef [7] \xe3\xf1\xe1\xf6\xe5\xdf\xef\xe3\xf1\xe1\xf6\xe5\xdf\xef [9] \xe3\xf1\xe1\xf6\xe5\xdf\xef\xf3\xf0\xdf\xf4\xe9 [11] \xe3\xf1\xe1\xf6\xe5\xdf\xef $VAR2 [1] 5 6 7 7 5 7 3 5 6 7 8 attr(,label.table ) attr(,label.table)$VAR1 NULL attr(,label.table)$VAR2 NULL I can then convert the characters to greek using Thomas' code, so there is no problem here. In file 12.sav the value labels are greek. The problem is that the file cannot be read. read.spss(~/Desktop/12.sav) Error in read.spss(~/Desktop/12.sav) : error reading system-file header In addition: Warning message: ~/Desktop/12.sav: position 0: Variable name begins with invalid character I also tried using use.value.labels=FALSE having the same message. read.spss(~/Desktop/12.sav, use.value.labels=FALSE) Error in read.spss(~/Desktop/12.sav, use.value.labels = FALSE) : error reading system-file header In addition: Warning message: ~/Desktop/12.sav: position 0: Variable name begins with invalid character The encoding of the spss files is windows-1253 (greek). The problem should be with other non-ascii characters too. Is there any workaround for this? Thanks in advance I.Soumpasis [[alternative HTML version deleted]] __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Thomas Lumley Assoc. Professor, Biostatistics [EMAIL PROTECTED] University of Washington, Seattle __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Help with efficient double sum of max (X_i, Y_i) (X Y vectors)
On Thu, 1 Feb 2007, Ravi Varadhan wrote: Jeff, Here is something which is a little faster: sum1 - sum(outer(x, x, FUN=pmax)) sum3 - sum(outer(x, y, FUN=pmax)) This is the sort of problem where profiling can be useful. My experience with pmax() is that it is surprisingly slow, presumably because it handles recycling and NAs In the example I profiled (an MCMC calculation) it was measurably faster to use function(x,y) {i- xy; x[i]-y[i]; x} -thomas Best, Ravi. --- Ravi Varadhan, Ph.D. Assistant Professor, The Center on Aging and Health Division of Geriatric Medicine and Gerontology Johns Hopkins University Ph: (410) 502-2619 Fax: (410) 614-9625 Email: [EMAIL PROTECTED] Webpage: http://www.jhsph.edu/agingandhealth/People/Faculty/Varadhan.html -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Jeffrey Racine Sent: Thursday, February 01, 2007 1:18 PM To: r-help@stat.math.ethz.ch Subject: [R] Help with efficient double sum of max (X_i, Y_i) (X Y vectors) Greetings. For R gurus this may be a no brainer, but I could not find pointers to efficient computation of this beast in past help files. Background - I wish to implement a Cramer-von Mises type test statistic which involves double sums of max(X_i,Y_j) where X and Y are vectors of differing length. I am currently using ifelse pointwise in a vector, but have a nagging suspicion that there is a more efficient way to do this. Basically, I require three sums: sum1: \sum_i\sum_j max(X_i,X_j) sum2: \sum_i\sum_j max(Y_i,Y_j) sum3: \sum_i\sum_j max(X_i,Y_j) Here is my current implementation - any pointers to more efficient computation greatly appreciated. nx - length(x) ny - length(y) sum1 - 0 sum3 - 0 for(i in 1:nx) { sum1 - sum1 + sum(ifelse(x[i]x,x[i],x)) sum3 - sum3 + sum(ifelse(x[i]y,x[i],y)) } sum2 - 0 sum4 - sum3 # symmetric and identical for(i in 1:ny) { sum2 - sum2 + sum(ifelse(y[i]y,y[i],y)) } Thanks in advance for your help. -- Jeff -- Professor J. S. Racine Phone: (905) 525 9140 x 23825 Department of EconomicsFAX:(905) 521-8232 McMaster Universitye-mail: [EMAIL PROTECTED] 1280 Main St. W.,Hamilton, URL: http://www.economics.mcmaster.ca/racine/ Ontario, Canada. L8S 4M4 `The generation of random numbers is too important to be left to chance' __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Thomas Lumley Assoc. Professor, Biostatistics [EMAIL PROTECTED] University of Washington, Seattle __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] read.spss and encodings
On Thu, 1 Feb 2007, Thomas Friedrichsmeier wrote: Hi! I'm having trouble with importing spss files containing non-ascii characters Peter has explained what is going on. It would be ideal for read.spss() to do the translation to the current locale. This would require knowing what encoding the SPSS file is using. I think it is always a one-byte encoding and in your case it is apparently Latin-1, but I don't know if this is always the case, or how to tell which encoding it uses. -thomas __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] How to get correct integration in C for step function?
On Sun, 21 Jan 2007, Lynette wrote: Dear all, I am using Rdqags in C to realize the integration. It seems for the continous C function I can get correct results. However, for step functions, the results are not correct. For example, the following one, when integrated from 0 to 1 gives 1 instead of the correct 1.5 Using integrate() in R for an R-defined step function gives the right answer (eg on the example in ?ecdf). This suggests a problem in your C code, since integrate() just calls dqags. -thomas __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] [ESS] How to get correct integration in C for step function?
On Mon, 22 Jan 2007, Lynette wrote: Well, I have no idea either. I can get correct answers for continous functions but incorrect for step functions. I have just tried using Rdqags from C for the function x0 and it worked fine (once I had declared all the arguments correctly). The code is below. -thomas #include Rinternals.h #include R_ext/Applic.h double stepfn(double x){ return (x0); } SEXP call_stepfn(SEXP x){ SEXP answer; int i,n; n=length(x); PROTECT(answer=allocVector(REALSXP,n)); for(i=0;in;i++){ REAL(answer)[i]=stepfn(REAL(x)[i]); } UNPROTECT(1); return answer; } void vec_stepfn(double *x, int n, void *ex){ int i; for (i=0;in;i++) x[i]=stepfn(x[i]); return; } void Cvec_stepfn(double *x, double *n){ vec_stepfn(x, *n, (void *) NULL); return; } SEXP int_stepfn(SEXP lower, SEXP upper){ SEXP answer; double result, abserr; int last, neval, ier; int lenw; int *iwork; double *work; int limit=100; double reltol=0.1; double abstol=0.1; lenw = 4 * limit; iwork = (int *) R_alloc(limit, sizeof(int)); work = (double *) R_alloc(lenw, sizeof(double)); Rdqags(vec_stepfn, (void *) NULL, REAL(lower), REAL(upper), abstol, reltol, result, abserr, neval, ier, limit, lenw, last, iwork, work); printf(%f %f %d\n, result,abserr, ier); PROTECT(answer=allocVector(REALSXP,1)); REAL(answer)[0]=result; UNPROTECT(1); return answer; } __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] How to get correct integration in C for step function?
On Mon, 22 Jan 2007, Lynette wrote: Dear all, especially to Thomas, I have figured out the problem. For the step function, something wrong with my C codes. I should use the expression ((x=0.25)(x=0.75)) ? 2:1 instead of ((x=1/4)(x=3/4)) ? 2:1 ). Have no idea why 0.25 makes difference from 1/4 in C. But now I can go ahead with the correct integration in C. Thank you all. And hope this helps to others. 1 and 4 are ints in C, so 1/4 is 0 (integer division). -thomas __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] integrate and quadratic forms
As the documentation for integrate() says, the function must be vectorized f: an R function taking a numeric first argument and returning a numeric vector of the same length. so you can't use sum(). You need matrix operations or an explicit loop to add up the terms. -thomas On Thu, 18 Jan 2007, rdporto1 wrote: Hi all. I'm trying to numerically invert the characteristic function of a quadratic form following Imhof's (1961, Biometrika 48) procedure. The parameters are: lambda=c(.6,.3,.1) h=c(2,2,2) sigma=c(0,0,0) q=3 I've implemented Imhof's procedure two ways that, for me, should give the same answer: #more legible integral1 = function(u) { o=(1/2)*sum(h*atan(lambda*u)+sigma^2*lambda*u/(1+lambda^2*u^2)) - q*u/2 rho=prod((1+lambda^2*u^2)^(h/4))*exp( (1/2)*sum((sigma*lambda*u)^2/(1+lambda^2*u^2)) ) integrand = sin(o)/(u*rho) } #same as above integral2= function(u) { ((1/2)*sum(h*atan(lambda*u)+sigma^2*lambda*u/(1+lambda^2*u^2)) - q*u/2)/ (u*(prod((1+lambda^2*u^2)^(h/4))* exp( (1/2)*sum((sigma*lambda*u)^2/(1+lambda^2*u^2)) ))) } The following should be near 0.18. However, nor the answers are near this value neither they agree each other! 1/2+(1/pi)*integrate(integral1,0,Inf)$value [1] 1.022537 1/2+(1/pi)*integrate(integral2,0,Inf)$value [1] 1.442720 What's happening? Is this a bug or OS specific? Shouldn't they give the same answer? Why do I get results so different from 0.18? In time: the procedure works fine for q=.2. I'm running R 2.4.1 in a PC with Windows XP 32bits. Other ways (in R) to find the distribution of general quadratic forms are welcome. Thanks in advance. Rogerio. __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Thomas Lumley Assoc. Professor, Biostatistics [EMAIL PROTECTED] University of Washington, Seattle __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Is it the PPS samples i needed in R?
On Fri, 12 Jan 2007, zhijie zhang wrote: Dear friends, I want to do a unequal probability sampling, that is, Probability Proportionate to size, Is it right for the following programs? Say my original dataset is: ID Population 1 100 2 200 3 300 IF the population is large ,then the corresponding ID has the large Probability to be selected. sample(A$ID, size=2, replace = FALSE, prob = A$population) #suppose the dataset name is A. Is it the PPS samples i needed ? No, this does not give PPS samples for size1. The pps and sampling packages have code for PPS samples. -thomas __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] SAS and R code hazard ratios
On Wed, 10 Jan 2007, [EMAIL PROTECTED] wrote: I am new to R and have been comparing CPH survival analysis hazard ratios between R and SAS PhReg. The binary covariates' HRs are the same, however the continuous variables, for example age, have quite different HRs although in the same direction. SAS PhReg produces HRs which are the change in risk for every one increment change in the independent variable. How do I interpret the HRs produced by R?Thanks much, C What function did you use to fit the model in R? If you used coxph(), in the survival package then you should get the same answers as SAS. If you used cph() in the design package then the output says what the increment is that correponds to the quoted hazard ratio, and the default is the interquartile range. -thomas __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] memory problem
On Sat, 6 Jan 2007, Zoltan Kmetty wrote: Hi! I had some memory problem with R - hope somebody could tell me a solution. I work with very large datasets, but R cannot allocate enough memoty to handle these datasets. You haven't said what you want to do with these datasets. -thomas __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] eval(parse(text vs. get when accessing a function
On Fri, 5 Jan 2007, Ramon Diaz-Uriarte wrote: I see, this is direct way of dealing with the problem. However, you first need to build the f list, and you might not know about that ahead of time. For instance, if I build a function so that the only thing that you need to do to use my function g is to call your function f.something, and then pass the something. I am still under the impression that, given your answer, using eval(parse(text is not your preferred way. What are the possible problems (if there are any, that is). I guess I am puzzled by rethink whether that was really the right question. There are definitely situations where parse() is necessary or convenient, or we wouldn't provide it. For example, there are some formula-manipulation problems where it really does seem to be the best solution. The point of my observation was that it is relatively common for people to ask about parse() solutions to problems, but relatively rare to see them in code by experienced R programmers. The 'rethink the question' point is that a narrowly-posed programming problem may suggest parse() as the answer, when thinking more broadly about what you are trying to do may allow a completely different approach [the example of lists is a common one]. The problem with eval(parse()) is not primarily one of speed. A problem with parse() is than manipulating text strings is easy to mess up, since text has so much less structure than code. A problem with eval() is that it is too powerful -- since it can do anything, it is harder to keep track of what it is doing. In one sense this is just a style issue, but I still think my comment is good advice. If you find yourself wanting to use parse() it is a good idea to stop and think about whether there is a better way to do it. Often, there is. Sometimes, there isn't. -thomas Thomas Lumley Assoc. Professor, Biostatistics [EMAIL PROTECTED] University of Washington, Seattle __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] understanding integer divide (%/%)
On Wed, 3 Jan 2007, ONKELINX, Thierry wrote: This is due to the internal representation of 0.1, which is not exactly 0.1 but very close to it. If you want to do an integer divide, you should only use integers to divide with. This must be more-or-less correct, but it is worth noting that 0.1*10==1 [1] TRUE 1/0.1==10 [1] TRUE 1%/%0.1==10 [1] FALSE so it isn't quite that simple. Interestingly, the results seem to vary by system -- on a G4 Mac I get 1 %/% (1/x) == x for all x from 1 to 50 -thomas Cheers, Thierry ir. Thierry Onkelinx Instituut voor natuur- en bosonderzoek / Reseach Institute for Nature and Forest Cel biometrie, methodologie en kwaliteitszorg / Section biometrics, methodology and quality assurance Gaverstraat 4 9500 Geraardsbergen Belgium tel. + 32 54/436 185 [EMAIL PROTECTED] www.inbo.be Do not put your faith in what statistics say until you have carefully considered what they do not say. ~William W. Watt A statistical analysis, properly conducted, is a delicate dissection of uncertainties, a surgery of suppositions. ~M.J.Moroney -Oorspronkelijk bericht- Van: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] Namens Jeffrey Prisbrey Verzonden: woensdag 3 januari 2007 14:21 Aan: r-help@stat.math.ethz.ch Onderwerp: [R] understanding integer divide (%/%) I am confused about why the following occurs: version _ platform i386-pc-mingw32 arch i386 os mingw32 system i386, mingw32 status major 2 minor 4.0 year 2006 month 10 day03 svn rev39566 language R version.string R version 2.4.0 (2006-10-03) 1 %/% 0.1 [1] 9 10 %/% 1 [1] 10 This effect led me into an trap when I tried to classify a set of proportions based on the first decimal place by integer dividing by 0.1. Can someone explain why this behavior occurs and give me an insight into how to predict it? Thanks, -- Jeff __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Thomas Lumley Assoc. Professor, Biostatistics [EMAIL PROTECTED] University of Washington, Seattle __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Any container in R?
On Mon, 1 Jan 2007, Feng Qiu wrote: Hi Duncan: Thank you very much! I checked out unique(), it does exactly what I want. But I'm still curious about if R provides STL(standard template library). No. Some things the STL does aren't needed in R, others are implemented differently, and others aren't implemented. One particularly important example is iterators, which will often either happen invisibly due to vectorized operations or will be done with the *apply family of functions. Your example could have been done either way. Using duplicated() is the vectorized approach; the apply approach would use tapply(). C++ is not terribly similar to R. A lot of the effort in STL is expended on allowing a piece of code to be used on different types (where appropriate). In R you have to expend effort on stopping a piece of code being used on different types (where inappropriate). -thomas Thomas Lumley Assoc. Professor, Biostatistics [EMAIL PROTECTED] University of Washington, Seattle __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] LU bug in Matrix package
On Thu, 28 Dec 2006, yangguoyi.ou wrote: There is a bug in Matrix package, please check it, thanks! You didn't say what R code you ran to get that output or why you think it is wrong. Let us experiment to see if we can guess what the problem is from the limited information provided x-t(Matrix(1:25,5)) x 5 x 5 Matrix of class dgeMatrix [,1] [,2] [,3] [,4] [,5] [1,]12345 [2,]6789 10 [3,] 11 12 13 14 15 [4,] 16 17 18 19 20 [5,] 21 22 23 24 25 lux-lu(x) Warning message: Exact singularity detected during LU decomposition. Now check that the decomposition is correct with(expand(a), P%*%L%*%U) 5 x 5 Matrix of class dgeMatrix [,1] [,2] [,3] [,4] [,5] [1,]12345 [2,]6789 10 [3,] 11 12 13 14 15 [4,] 16 17 18 19 20 [5,] 21 22 23 24 25 Ok, it is correct. Now let's Look at the matrices expand(a) $L 5 x 5 Matrix of class dtrMatrix [,1] [,2] [,3] [,4] [,5] [1,] 1. . . . . [2,] 0.04761905 1. . . . [3,] 0.52380952 0.5000 1. . . [4,] 0.28571429 0.7500 0.35400130 1. . [5,] 0.76190476 0.2500 0.5000 0. 1. $U 5 x 5 Matrix of class dtrMatrix [,1] [,2] [,3] [,4] [,5] [1,] 2.10e+01 2.20e+01 2.30e+01 2.40e+01 2.50e+01 [2,] . 9.523810e-01 1.904762e+00 2.857143e+00 3.809524e+00 [3,] . . -1.919092e-15 -3.711332e-15 -5.614541e-15 [4,] . . . 3.781500e-16 6.288326e-16 [5,] . . . . 0.00e+00 $P 5 x 5 sparse Matrix of class pMatrix [1,] . 1 . . . [2,] . . . 1 . [3,] . . 1 . . [4,] . . . . 1 [5,] 1 . . . . Perhaps the output was a trimmed version of this? The L and U matrices agree, but you didn't show the permutation matrix. Did you just miss the fact that the decomposition is P%*%L%*%U? If you aren't used to S4 classes it is easy to miss the fact that class?LU gives help on the structure of the LU class, and if you try to guess what the components are without the documentation it is easy to be confused. -thomas Matlab result: x = 1 2 3 4 5 6 7 8 910 1112131415 1617181920 2122232425 lu(x) ans = 21. 22. 23. 24. 25. 0.04760.95241.90482.85713.8095 0.76190.2500 -0. -0. -0. 0.52380.50000.4839 00. 0.28570.7500 -0.0645 00. Gsl result: 21 22 23 24 25 0.047619 0.952381 1.90476 2.85714 3.80952 0.761905 0.25 -3.44169e-015 -6.88338e-015 -1.03251e-014 0.52381 0.5 -0.0322581 -1.77636e-015 -1.66533e-015 0.285714 0.75 -0.0645161 -0 2.22045e-016 R result: $L 5 x 5 Matrix of class dtrMatrix [,1] [,2] [,3] [,4] [,5] [1,] 1. . . . . [2,] 0.04761905 1. . . . [3,] 0.52380952 0.5000 1. . . [4,] 0.28571429 0.7500 0.35400130 1. . [5,] 0.76190476 0.2500 0.5000 0. 1. $U 5 x 5 Matrix of class dtrMatrix [,1] [,2] [,3] [,4] [,5] [1,] 2.10e+01 2.20e+01 2.30e+01 2.40e+01 2.50e+01 [2,] . 9.523810e-01 1.904762e+00 2.857143e+00 3.809524e+00 [3,] . . -1.919092e-15 -3.711332e-15 -5.614541e-15 [4,] . . . 3.781500e-16 6.288326e-16 [5,] . . . . 0.00e+00 [[alternative HTML version deleted]] __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Thomas Lumley Assoc. Professor, Biostatistics [EMAIL PROTECTED] University of Washington, Seattle __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Memory problem on a linux cluster using a large data set [Broadcast]
On Thu, 21 Dec 2006, Iris Kolder wrote: Thank you all for your help! So with all your suggestions we will try to run it on a computer with a 64 bits proccesor. But i've been told that the new R versions all work on a 32bits processor. I read in other posts that only the old R versions were capable of larger data sets and were running under 64 bit proccesors. I also read that they are adapting the new R version for 64 bits proccesors again so does anyone now if there is a version available that we could use? Huh? R 2.4.x runs perfectly happily accessing large memory under Linux on 64bit processors (and Solaris, and probably others). I think it even works on Mac OS X now. For example: x-rnorm(1e9) gc() used (Mb) gc trigger (Mb) max used (Mb) Ncells 222881 12.0 467875 25.0 35 18.7 Vcells 1000115046 7630.3 1000475743 7633.1 1000115558 7630.3 -thomas Thomas Lumley Assoc. Professor, Biostatistics [EMAIL PROTECTED] University of Washington, Seattle __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] attach and object masking
On Tue, 19 Dec 2006, DEEPANKAR BASU wrote: Hi R users! I am new to R. When I try to attach a simple dataset using the attach() command, I get the following message: attach(data1) The following object(s) are masked from package:base : write Can someone tell me what this means? (`write' is the name of a variable in the dataset). And, do I need to do do something about this. It means that 'write' is the name of a variable in the dataset. R is warning you that you have two things called 'write' -- your variable and a function in the base package. It also means that you have missed at least three upgrades of R (the fourth is just out) since in version 2.3.0 and more recent you don't get the warning when a variable and a function have the same name, only for two variables or two functions. There have been quite a lot of other changes since your version of R, so it would be worth upgrading. -thomas __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] attach and object masking
On Tue, 19 Dec 2006, Thomas Lumley wrote: It also means that you have missed at least three upgrades of R (the fourth is just out) since in version 2.3.0 and more recent you don't get the warning when a variable and a function have the same name, only for two variables or two functions. There have been quite a lot of other changes since your version of R, so it would be worth upgrading. I just noticed I didn't say explicitly whether you need to do anything else about the warning. You don't. -thomas __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Upgrading
On Tue, 19 Dec 2006, DEEPANKAR BASU wrote: Hi! As per Thomas' advice, I upgraded R by using update.packages() and got the following warning messages: That was not my advice on how to upgrade. update.packages() updates the packages. You need to download a new version of R itself. You will then need to update or reinstall the packages. The warning messages are because you are updating to versions of the packages that do not run on your old version of R. -thomas Warning messages: 1: installation of package 'lmtest' had non-zero exit status in: install.packages(update[, Package], instlib, contriburl = contriburl, 2: installation of package 'quadprog' had non-zero exit status in: install.packages(update[, Package], instlib, contriburl = contriburl, 3: installation of package 'cluster' had non-zero exit status in: install.packages(update[, Package], instlib, contriburl = contriburl, 4: installation of package 'tseries' had non-zero exit status in: install.packages(update[, Package], instlib, contriburl = contriburl, Do I need to worry about these messages? Do I need to do something else to complete the upgrade process? Another question: what is the command for renaming an existing variable? Thanks. Deepankar __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Thomas Lumley Assoc. Professor, Biostatistics [EMAIL PROTECTED] University of Washington, Seattle __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.