Re: [R] residual plots for lmer in lme4 package
I am doubtful whether standard residual plots are very useful in this context. One wants the theoretical effects Ui to have a normal distribution. If there are similar amounts of information on each patient, maybe it will not be too bad to extract the estimated effects and check them for normality. I don't think you can use residuals() to extract them, as glmer() does not have the notion of levels. Maybe they can be extracted using ranef(), but I do not see any examples for use with glmer() on the help pages. The issue of checking for normality of effects in multi-level models has not been very much researched, as far as I can tell. The function residuals() gives residuals that adjust for all except the highest level of random effects. Depending on the relative magnitudes of the variance components, whether or not these residuals are anywhere near normal may not be of much or any consequence. John Maindonald email: [EMAIL PROTECTED] phone : +61 2 (6125)3473fax : +61 2(6125)5549 Centre for Mathematics Its Applications, Room 1194, John Dedman Mathematical Sciences Building (Building 27) Australian National University, Canberra ACT 0200. On 17 Aug 2007, at 8:00 PM, [EMAIL PROTECTED] wrote: From: Martin Henry H. Stevens [EMAIL PROTECTED] Date: 17 August 2007 12:08:15 AM To: Margaret Gardiner-Garden [EMAIL PROTECTED] Cc: [EMAIL PROTECTED] [EMAIL PROTECTED] Subject: Re: [R] residual plots for lmer in lme4 package Hi Margaret, Have a look at qqmath in the lattice package. ?qqmath Hank On Aug 16, 2007, at 2:45 AM, Margaret Gardiner-Garden wrote: Hi, I was wondering if I might be able to ask some advice about doing residual plots for the lmer function in the lme4 package. Our group's aim is to find if the expression staining of a particular gene in a sample (or core) is related to the pathology of the core. To do this, we used the lmer function to perform a logistic mixed model below. I apologise in advance for the lack of subscripts. logit P(yij=1) = â0 + Ui + â1Patholij where Ui~N(0, óu2), i indexes patient, j indexes measurement, Pathol is an indicator variable (0,1) for benign epithelium versus cancer and yij is the staining indicator (0,1) for each core where yij equals 1 if the core stains positive and 0 otherwise. (I have inserted some example R code at the end of this message) I was wondering if you knew how I could test that the errors Ui are normally distributed in my fit. I am not familiar with how to do residual plots for a mixed logistic regression (or even for any logistic regression!). Any advice would be greatly appreciated! Thanks and Regards Marg __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] LSD, HSD,...
follow-on rant Stepwise regression variable selection methods make multiple post hoc comparisons. The number of comparisons may be very large, vastly more than the half-dozen post-hoc comparisons that are common in an experimental design context. There is a disconnect here. The multiple testing issue is noted in pretty much every discussion of analysis of experimental data, but not commonly mentioned (at least in older texts) in discussions of stepwise regression, best subsets and related regression approaches. One reason for this silence may be that there is no ready HSD-like fix. The SEs and t-statistics that lm() gives for the finally selected model can be grossly optimistic. Running the analysis with the same model matrix, but with y-values that are noise, can give a useful wake-up call. John Maindonald email: [EMAIL PROTECTED] phone : +61 2 (6125)3473fax : +61 2(6125)5549 Centre for Mathematics Its Applications, Room 1194, John Dedman Mathematical Sciences Building (Building 27) Australian National University, Canberra ACT 0200. On 16 Jul 2007, at 8:00 PM, Simon Blomberg wrote: If you have a priori planned comparisons, you can just test those using linear contrasts, with no need to correct for multiple testing. If you do not, and you are relying on looking at the data and analysis to tell you which treatment means to compare, and you are considering several tests, then you should consider correcting for multiple testing. There is a large literature on the properties of the various tests. (Tukey HSD usually works pretty well for me.) rant Why do people design experiments with a priori hypotheses in mind, yet test them using post hoc comparison procedures? It's as if they are afraid to admit that they had hypotheses to begin with! Far better to test what you had planned to test using the more powerful methods for planned comparisons, and leave it at that. /rant On Mon, 2007-07-16 at 09:52 +0200, Adrian J. Montero Calvo wrote: Hi, I'm designing a experiment in order to compare the growing of several clones of a tree specie. It will be a complete randomized block design. How can I decide what model of mean comparision to choose? LSD, HSD,TukeyHSD, Duncan,...? Thanks in advance __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting- guide.html and provide commented, minimal, self-contained, reproducible code. -- Simon Blomberg, BSc (Hons), PhD, MAppStat. Lecturer and Consultant Statistician Faculty of Biological and Chemical Sciences The University of Queensland St. Lucia Queensland 4072 Australia Room 320 Goddard Building (8) T: +61 7 3365 2506 email: S.Blomberg1_at_uq.edu.au __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Bad points in regression [Broadcast]
None of Andy's comments) are inconsistent with the point that rlm() and lqs(), if they disagree with lm(), likely offer better places to start than does lm(), in identifying points that should be examined as in some sense outliers. All such methods are to be used, not as crutches, but as sources of insight. Incidentally, the rlm class inherits from lm, and plot.lm() (or, preferably, the plot generic) can be used with rlm objects. This is not the case for lqs objects. John Maindonald. On 17 Mar 2007, at 10:00 PM, Andy Liaw wrote: From: Liaw, Andy [EMAIL PROTECTED] Date: 17 March 2007 4:21:53 AM To: Bert Gunter [EMAIL PROTECTED], [EMAIL PROTECTED], r-help@stat.math.ethz.ch Subject: Re: [R] Bad points in regression [Broadcast] (My turn on the soapbox ...) I'd like to add a bit of caveat to Bert's view. I'd argue (perhaps even plead) that robust/resistant procedures be used with care. They should not be used as a shortcut to avoid careful analysis of data. I recalled that in my first course on regression, the professor made it clear that we're fitting models to data, not the other way around. When the model fits badly to (some of the) the data, do examine and think carefully about what happened. Verify that bad data are indeed bad, instead of using statistical criteria to make that judgment. A scientific colleague reminded me of this point when I tried to sell him the idea of robust/resistant methods: Don't use these methods as a crutch to stand on badly run experiments (or poorly fitted models). Cheers, Andy From: Bert Gunter (mount soapbox...) While I know the prior discussion represents common practice, I would argue -- perhaps even plead -- that the modern(?? 30 years old now) alternative of robust/resistant estimation be used, especially in the readily available situation of least-squares regression. RSiteSearch(Robust) will bring up numerous possibilities.rrcov and robustbase are at least two packages devoted to this, but the functionality is available in many others (e.g. rlm() in MASS). Bert Gunter Genentech Nonclinical Statistics South San Francisco, CA 94404 650-467-7374 -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Ted Harding Sent: Friday, March 16, 2007 6:44 AM To: r-help@stat.math.ethz.ch Subject: Re: [R] Bad points in regression On 16-Mar-07 12:41:50, Alberto Monteiro wrote: Ted Harding wrote: alpha - 0.3 beta - 0.4 sigma - 0.5 err - rnorm(100) err[15] - 5; err[25] - -4; err[50] - 10 x - 1:100 y - alpha + beta * x + sigma * err ll - lm(y ~ x) plot(ll) ll is the output of a linear model fiited by lm(), and so has several components (see ?lm in the section Value), one of which is residuals (which can be abbreviated to res). So, in the case of your example, which(abs(ll$res)2) 15 25 50 extracts the information you want (and the 2 was inspired by looking at the residuals plot from your plot(ll)). Ok, but how can I grab those points _in general_? What is the criterium that plot used to mark those points as bad points? Ahh ... ! I see what you're after. OK, look at the plot method for lm(): ?plot.lm ## S3 method for class 'lm': plot(x, which = 1:4, caption = c(Residuals vs Fitted, Normal Q-Q plot, Scale-Location plot, Cook's distance plot), panel = points, sub.caption = deparse(x$call), main = , ask = prod(par(mfcol)) length(which) dev.interactive(), ..., id.n = 3, labels.id = names(residuals(x)), cex.id = 0.75) where (see further down): id.n: number of points to be labelled in each plot, starting with the most extreme. and note, in the default parameter-values listing above: id.n = 3 Hence, the 3 most extreme points (according to the criterion being plotted in each plot) are marked in each plot. So, for instance3, try plot(ll,id.n=5) and you will get points 10,15,25,28,50. And so on. But that pre-supposes that you know how many points are exceptional. What is meant by extremeis not stated in the help page ?plot.lm, but can be identified by inspecting the code for plot.lm(), which you can see by entering plot.lm In your example, if you omit the line which assigns anomalous values to err[15[, err[25] and err[50], then you are likely to observe that different points get identified on different plots. For instance, I just got the following results for the default id.n=3: [1] Residuals vs Fitted: 41,53,59 [2] Standardised Residuals:41,53,59 [3] sqrt(Stand Res) vs Fitted: 41,53,59 [4] Cook's Distance: 59,96,97 There are several approaches (with somewhat different outcomes) to identifying outliers. If you apply one of these, you will probably get the identities of the points anyway. Again in the context of your example (where in fact you deliberately set 3 points to have exceptional errors, thus
[R] [R-pkgs] New version of lme4 and new mailing list R-SIG-mixed-models
Dear Douglas - I have tried several fits that relate to ch 10 of the 2nd edition of Maindonald Braun. In two cases a fit that does not currently converge with lmer does now converge with lme2. In one case where I took logs in order to get convergence (the data did not really demand it), I can now get away without taking logs. Basically, none of the problems that I had to work around when I did the calculations for that chapter have surfaced. An exercise that had been problematic now converges without apparent problem. In one case CPU time was reduced by a factor of around 3. In one case where I formerly could not get convergence, the reduction (to termination of the calculations) was by a factor of around 10. The results seems similar, with a small increase in the loglikelihood (perhaps ~0.5; I have not checked very carefully) in some cases. If you want more detail, I will provide it. It looks a huge improvement. I am sure that R users will be duly grateful for your efforts. Regards John Maindonald. John Maindonald email: [EMAIL PROTECTED] phone : +61 2 (6125)3473fax : +61 2(6125)5549 Centre for Mathematics Its Applications, Room 1194, John Dedman Mathematical Sciences Building (Building 27) Australian National University, Canberra ACT 0200. sessionInfo() R version 2.4.1 Patched (2007-01-17 r40518) i386-apple-darwin8.8.1 locale: C attached base packages: [1] stats graphics grDevices datasets utils methods [7] base other attached packages: DAAGMASSlme4 Matrix lattice 0.917.2-31 0.9975-11 0.9975-8 0.14-16 __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] overdispersion
I would say rather that for binary data (binomial data with n=1) it is not possible to detect overdispersion from examination of the Pearson chi-square or the deviance. Overdispersion may be, and often is, nevertheless present. I am arguing that overdispersion is properly regarded as a function of the variance-covariance structure, not as a function of the sample data. The variance of a two-point distribution is a known function of the mean, providing that independence and identity of distribution can be assumed, or providing that the correlation structure is otherwise known and the mean is constant. That proviso is crucial! If there is some sort of grouping, it may be appropriate to aggregate data over the groups, yielding data that have a binomial form with n1. Over-dispersion can now be detected from the Pearson chi-square or from the deviance. Note that the quasi models assume that the multiplier for the binomial or other variance is constant with p; that may or may not be realistic. Generalized linear mixed models make their own different assumptions about how the variance changes as a function of p; again these may or may not be realistic. It is then the error structure that is crucial. To the extent that distracts from careful thinking about that structure, the term overdispersion is unsatisfactory. There's no obvious way that I can see to supply glm() with an estimate of the dispersion that has been derived independently of the current analysis. Especially in the binary case, this would sometimes be useful. John Maindonald email: [EMAIL PROTECTED] phone : +61 2 (6125)3473fax : +61 2(6125)5549 Centre for Mathematics Its Applications, Room 1194, John Dedman Mathematical Sciences Building (Building 27) Australian National University, Canberra ACT 0200. On 12 Jan 2007, at 10:00 PM, [EMAIL PROTECTED] wrote: From: Peter Dalgaard [EMAIL PROTECTED] Date: 12 January 2007 5:04:26 AM To: evaiannario [EMAIL PROTECTED] Cc: r-help@stat.math.ethz.ch r-help@stat.math.ethz.ch Subject: Re: [R] overdispersion evaiannario wrote: How can I eliminate the overdispersion for binary data apart the use of the quasibinomial? There is no such thing as overdispersion for binary data. (The variance of a two-point distribution is a known function of the mean.) If what you want to do is include random effects of some sort of grouping then you might look into generalized linear mixed models via lmer() from the lme4 package or glmmPQL from MASS. __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] unexpected result in glm (family=poisson) for data with an only zero response in one factor
The Wald statistics that are returned as z value can be a very rough approximation. The standard error is radically different, on a logarithmic scale, between log(mu) = -20.30 [the best glm() managed in approximating -infinity] and log(mu) + log(a) = -0.29. It is actually worse than might appear; the SE=2457.38 is an approximation to infinity! The phenomenon is an extreme version, now with a poisson error model, of the Hauck-Donner effect (Modern Applied Statistics with S, 4th edn, pp.197-199) that occurs with binomial data. Use the result from the anova likelihood ratio test, where the approximations that are involved are usually much better behaved (but it would not hurt to do a simulation as a check.) There is an example of this phenomenon with a poisson error model in Subsection 8.4.2 (the same subsection number both for the 1st edn and the forthcoming 2nd edn) of Data Analysis Graphics using R, CUP, 2003 and 2006. Install and attach the DAAG package and try example(moths) John Maindonald email: [EMAIL PROTECTED] phone : +61 2 (6125)3473fax : +61 2(6125)5549 Centre for Mathematics Its Applications, Room 1194, John Dedman Mathematical Sciences Building (Building 27) Australian National University, Canberra ACT 0200. Dear members, here is my trouble: My data consists of counts of trapped insects in different attractive traps. I usually use GLMs with a poisson error distribution to find out the differences between my traitments (and to look at other factor effects). But for some dataset where one traitment contains only zeros, GLM with poisson family fail to find any difference between this particular traitment and anyother one (even with traitment that have trapped a lot of insects). GLMs with gaussian family does not seem to have the same problem but GLMs with binomial family does. I'm not sure if it is a statistical problem or if it comes from R... in the latter case I think some solution exists (perhaps in the options of the glm() function ?). Thank you for your help. Here I figure out an exemple to past in the console: ## START ## # Take a data set of counts for two traitments, one containing only zeros A=c (0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 ,0,0,0,0,0) B=c (1,0,0,0,2,1,0,0,1,2,0,0,0,1,2,2,0,1,1,0,1,0,2,1,1,0,1,2,0,1,0,1,1,1,0 ,1,1,1,0,1) traitment=c(rep(A,40),rep(B,40)) response=c(A,B) mydata=data.frame(traitment ,response) # Make a GLM on this dataset , with family=poisson g=glm(response~traitment, data=mydata, family=poisson) anova.glm(g,test=Chisq) # There is an effect of the traitment ... summary(g) # But traitment A does not differ from traitment B ! ! ! (the pvalue is always close from 1 in such cases) # Now if you replace only one zero of the A reponse to 1, the GLM works properly: mydata[1,2]=1 g=glm(response~traitment, data=mydata, family=poisson) anova.glm(g,test=Chisq) summary(g) ## ### END ## Antonin Ferry (PhD) Laboratoire d'Ecobiologie des Insectes Parasitoides http://www.parasitoides.univ-rennes1.fr Université de Renes1, FRANCE __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] unexpected result in glm (family=poisson) for data with an only zero response in one factor
PS. In part, the problem is with the use of the log link, arising because the limit of log(mu), as mu goes to 0, is minus infinity. This is not an appropriate scale on which to represent a fitted value that is zero. The estimated SE for a fitted value of zero should be 0. You will get a more sensible answer if you set family=poisson (link=sqrt) g - glm(response~traitment, data=mydata, family=poisson(link=sqrt)) summary(g) Call: glm(formula = response ~ traitment, family = poisson(link = sqrt), data = mydata) Deviance Residuals: Min 1Q Median 3Q Max -1.225e+00 -2.730e-05 -2.730e-05 2.745e-01 1.193e+00 Coefficients: Estimate Std. Error z value Pr(|z|) (Intercept) 0.193 0.0790569 0.0002441 traitmentB 0.8660061 0.11180347.746 9.5e-15 (Dispersion parameter for poisson family taken to be 1) Null deviance: 75.485 on 79 degrees of freedom Residual deviance: 33.896 on 78 degrees of freedom AIC: 89.579 Number of Fisher Scoring iterations: 14 John Maindonald email: [EMAIL PROTECTED] phone : +61 2 (6125)3473fax : +61 2(6125)5549 Centre for Mathematics Its Applications, Room 1194, John Dedman Mathematical Sciences Building (Building 27) Australian National University, Canberra ACT 0200. Dear members, here is my trouble: My data consists of counts of trapped insects in different attractive traps. I usually use GLMs with a poisson error distribution to find out the differences between my traitments (and to look at other factor effects). But for some dataset where one traitment contains only zeros, GLM with poisson family fail to find any difference between this particular traitment and anyother one (even with traitment that have trapped a lot of insects). GLMs with gaussian family does not seem to have the same problem but GLMs with binomial family does. I'm not sure if it is a statistical problem or if it comes from R... in the latter case I think some solution exists (perhaps in the options of the glm() function ?). Thank you for your help. Here I figure out an exemple to past in the console: ## START ## # Take a data set of counts for two traitments, one containing only zeros A=c (0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 ,0,0,0,0,0) B=c (1,0,0,0,2,1,0,0,1,2,0,0,0,1,2,2,0,1,1,0,1,0,2,1,1,0,1,2,0,1,0,1,1,1,0 ,1,1,1,0,1) traitment=c(rep(A,40),rep(B,40)) response=c(A,B) mydata=data.frame(traitment ,response) # Make a GLM on this dataset , with family=poisson g=glm(response~traitment, data=mydata, family=poisson) anova.glm(g,test=Chisq) # There is an effect of the traitment ... summary(g) # But traitment A does not differ from traitment B ! ! ! (the pvalue is always close from 1 in such cases) # Now if you replace only one zero of the A reponse to 1, the GLM works properly: mydata[1,2]=1 g=glm(response~traitment, data=mydata, family=poisson) anova.glm(g,test=Chisq) summary(g) ## ### END ## Antonin Ferry (PhD) Laboratoire d'Ecobiologie des Insectes Parasitoides http://www.parasitoides.univ-rennes1.fr Université de Renes1, FRANCE __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Conservative ANOVA tables in lmer
A Wiki entry is an excellent idea. I am happy to try to help. An account of mcmcsamp() might be very useful part of the Wiki. My limited investigations suggest that once the data starts to overwhelm the prior (maybe ~3 df for an effect that is of interest), the posterior distribution that it gives provides a very good approximation to the sampling distribution. I have been meaning to put aside time to try to work out, with the help of a colleague here at ANU, how the Kenward Roger (Biometrics, 1997) approximation might be implemented in lmer, but it has'nt yet happened and is unlikely to do so for a while. John Maindonald email: [EMAIL PROTECTED] phone : +61 2 (6125)3473fax : +61 2(6125)5549 Centre for Mathematics Its Applications, Room 1194, John Dedman Mathematical Sciences Building (Building 27) Australian National University, Canberra ACT 0200. On 10 Sep 2006, at 8:00 PM, [EMAIL PROTECTED] wrote: From: Spencer Graves [EMAIL PROTECTED] Date: 10 September 2006 4:54:50 PM To: Andrew Robinson [EMAIL PROTECTED] Cc: Douglas Bates [EMAIL PROTECTED], R-Help r- [EMAIL PROTECTED] Subject: Re: [R] Conservative ANOVA tables in lmer Hi, Doug, et al.: I'll volunteer to do the same, which is an extension of much of what I've been doing for R Help for a while now. Regarding writing a FAQ, what about a Wiki entry (and maybe ultimately a vignette)? This thread provides notes around which such could be built. Another piece might be an example from Scheffé (1958), which I sent as a reply to an earlier comment on this thread, (foolishly sent without reducing the cc list, which means it awaits moderator approval). Each time a question of this nature arises, someone checks the Wiki, edits adds something to it if necessary, then replies to the list with the reference to the appropriate Wiki entry. Spencer Graves Andrew Robinson wrote: On Thu, Sep 07, 2006 at 07:59:58AM -0500, Douglas Bates wrote: I would be happy to re-institute p-values for fixed effects in the summary and anova methods for lmer objects using a denominator degrees of freedom based on the trace of the hat matrix or the rank of Z:X if others will volunteer to respond to the these answers are obviously wrong because they don't agree with whatever and the idiot who wrote this software should be thrashed to within an inch of his life messages. I don't have the patience. This seems to be more than fair to me. I'll volunteer to help explain why the anova.lmer() output doesn't match SAS, etc. Is it worth putting a caveat in the output and the help files? Is it even worth writing a FAQ about this? Cheers Andrew [[alternative HTML version deleted]] __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Robustness of linear mixed models
I'd use mcmcsamp() to examine the posterior distribution, under a relatively uninformative prior, of of the parameter estimates. For estimates that are based on four or five or more degrees of freedom, I'd surmise that the prior will not matter too much. With estimates where the number of degrees of freedom is one or two or three, the posterior distribution may vary greatly from one run of mcmcsamp() to another. Of course, the definition of degrees of freedom becomes quite fraught if there is severe imbalance; e.g., some of the items that contribute to the estimate based on much more data than others. Subject to such caveats as just noted, I'd expect that credible intervals derived from the posterior distributions would be close to the usual frequentist confidence intervals. The main effect of the non-normality may be that the estimates are inefficient, i.e., the variance may be larger, or the distributions more dispersed, than for true maximum likelihood estimates, were you able to obtain them! John Maindonald John Maindonald email: [EMAIL PROTECTED] phone : +61 2 (6125)3473fax : +61 2(6125)5549 Mathematical Sciences Institute, Room 1194, John Dedman Mathematical Sciences Building (Building 27) Australian National University, Canberra ACT 0200. On 27 Jun 2006, at 8:00 PM, [EMAIL PROTECTED] wrote: From: Bruno L. Giordano [EMAIL PROTECTED] Date: 27 June 2006 11:21:25 AM To: r-help@stat.math.ethz.ch Subject: [R] Robustness of linear mixed models Hello, with 4 different linear mixed models (continuous dependent) I find that my residuals do not follow the normality assumption (significant Shapiro-Wilk with values equal/higher than 0.976; sample sizes 750 or 1200). I find, instead, that my residuals are really well fitted by a t distribution with dofs' ranging, in the different datasets, from 5 to 12. Should this be considered such a severe violation of the normality assumption as to make model-based inferences invalid? Thanks a lot, Bruno [[alternative HTML version deleted]] __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Re: [R] regression modeling
An interesting example of this is the forest cover data set that is available from http://www.ics.uci.edu/~mlearn The proportions of the different cover types change systematically as one moves through the file. It seems that distance through the file is a proxy for the geographical co-ordinates. Fitting a tree-based or suchlike model to the total data is not the way to go, unless one is going to model the pattern of change through the file as part of the modeling exercise. In any case, some preliminary exploration of the data is called for, so that such matters come to attention. For my money, the issue is not ease of performing regression with huge data sets, but ease of data exploration. John Maindonald email: [EMAIL PROTECTED] phone : +61 2 (6125)3473fax : +61 2(6125)5549 Mathematical Sciences Institute, Room 1194, John Dedman Mathematical Sciences Building (Building 27) Australian National University, Canberra ACT 0200. On 26 Apr 2006, at 8:00 PM, [EMAIL PROTECTED] wrote: From: Berton Gunter [EMAIL PROTECTED] Date: 26 April 2006 6:47:12 AM To: 'Weiwei Shi' [EMAIL PROTECTED], 'bogdan romocea' [EMAIL PROTECTED] Cc: 'r-help' R-help@stat.math.ethz.ch Subject: Re: [R] regression modeling May I offer a perhaps contrary perspective on this. Statistical **theory** tells us that the precision of estimates improves as sample size increases. However, in practice, this is not always the case. The reason is that it can take time to collect that extra data, and things change over time. So the very definition of what one is measuring, the measurement technology by which it is measured (think about estimating tumor size or disease incidence or underemployment, for example), the presence or absence of known or unknown large systematic effects, and so forth may change in unknown ways. This defeats, or at least complicates, the fundamental assumption that one is sampling from a (fixed) population or stable (e.g. homogeneous, stationary) process, so it's no wonder that all statistical bets are off. Of course, sometimes the necessary information to account for these issues is present, and appropriate (but often complex) statistical analyses can be performed. But not always. Thus, I am suspicious, cynical even, about those who advocate collecting all the data and subjecting the whole vast heterogeneous mess to arcane and ever more computer intensive (and adjustable parameter ridden) data mining algorithms to detect trends or discover knowledge. To me, it sounds like a prescription for turning on all the equipment and waiting to see what happens in the science lab instead of performing careful, well-designed experiments. I realize, of course, that there are many perfectly legitimate areas of scientific research, from geophysics to evolutionary biology to sociology, where one cannot (easily) perform planned experiments. But my point is that good science demands that in all circumstances, and especially when one accumulates and attempts to aggregata data taken over spans of time and space, one needs to beware of oversimplification, including statistical oversimplification. So interrogate the measurement, be skeptical of stability, expect inconsistency. While all models are wrong but some are useful (George Box), the second law tells us that entropy still rules. (Needless to say, public or private contrary views are welcome). -- Bert Gunter Genentech Non-Clinical Statistics South San Francisco, CA The business of the statistician is to catalyze the scientific learning process. - George E. P. Box __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Re: [R] Optimal platform for R
I've found memory management sometimes problematic under Windows. I've a calculation that runs without difficulty in 512MB under Mac OS X (10.3 or 10.4); I'd expect the same under Linux. Under Windows (XP professional) with 512MB, it requires a freshly booted system. But maybe the new machines will have so much memory that memory management will not be an issue. John Maindonald email: [EMAIL PROTECTED] phone : +61 2 (6125)3473fax : +61 2(6125)5549 Mathematical Sciences Institute, Room 1194, John Dedman Mathematical Sciences Building (Building 27) Australian National University, Canberra ACT 0200. On 10 Mar 2006, at 10:00 PM, [EMAIL PROTECTED] wrote: From: Prof Brian Ripley [EMAIL PROTECTED] Date: 10 March 2006 6:50:03 PM To: [EMAIL PROTECTED] Cc: r-help@stat.math.ethz.ch Subject: Re: [R] Optimal platform for R On Thu, 9 Mar 2006, [EMAIL PROTECTED] wrote: I'm looking to buy a new desktop which will primarily be used for analyses of large datasets (100s of MB). I've seen postings from several years back re the 'optimal' platform for running R, but nothing more recently. It is a subject which comes up every few months. Many of the developers are running dual (or dual-core) Opterons/Athlon 64s under Linux these days. Specifically, I want to know: 1) if I run R under Windows, does having a dual-processor machine help speed things up? And 2) is it still true that R performs about as well under Windows as Linux? Duncan Murdoch has already mentioned the 64-bit advantage if you need large datasets, but there is also a speed penalty if you do not. Your description seems on the margins (depends how many 100s and what the format is and what you want to do). One advantage of AMD64 Linux is that I can run either 32- or 64-bit versions of R and choose to have speed or space for any given task. A dual processor will be of little help in running R faster. R's interpreter is single-threaded, and although you can get some advantage in using multi-threaded BLAS libraries in large matrix computations these are not readily available for R under Windows, and the advantage is often small under Linux. Running two or more instances of R will take advantage of dual processers, and I have been running dual CPU machines for a decade. As for Windows vs Linux, R runs on the same hardware at about the same speed when comparing the standard Windows build with a shared library version on Linux (standard for e.g. the RH RPMs), but the standard Linux build is 10-20% faster. For one set of comparisons see http://sekhon.berkeley.edu/macosx/ -- Brian D. Ripley, [EMAIL PROTECTED] Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272866 (PA) Oxford OX1 3TG, UKFax: +44 1865 272595 __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Re: [R] repeated measures ANOVA
There seem to several issues here: 1) In the analysis that has a (1|Subject) error term, there is a large negative correlation between the parameter estimates for time and time:group. Overall, the effect of time is significant, as can be seen from time.lme - lme ( p.pa ~ time * group, random = ~ 1 | subject, method=ML) notime.lme - lme ( p.pa ~ group, random = ~ 1 | subject, method=ML) anova(time.lme, notime.lme) Model df AIC BIC logLik Test L.Ratio p-value time.lme 1 6 245.0 253.4 -116.5 notime.lme 2 4 254.0 259.6 -123.0 1 vs 2 12.95 0.0015 What is uncertain is how this time effect should be divided up, between a main effect of slope and the interaction. 2) What the interaction plot makes clear, and what the change in treatment (for group 1 only?) for time point 3 should have suggested is that the above analysis is not really appropriate. There are two comparisons: (i) at time points 1 and 2; and (ii) at time point 3. (3) The above does not allow for a random group to group change in slope, additional to the change that can be expected from random variation about the line. Models 3 and 4 in your account do this, and allow also for a group:subject and group:time random effects that make matters more complicated still. The fitting of such a model has the consequence that between group differences in slope are entirely explained by this random effect. Contrary to what the lmer() output might suggest, no degrees of freedom are left with which to estimate the time:group interaction. (Or you can estimate the interaction, and no degrees of freedom are left for either the time or time:group random effect). All you can talk about is the average and the difference of the time effects for these two specific groups. Thus, following on from (3), I do not understand how lmer() is able to calculate a t-statistic. There seems to me to be double dipping. Certainly, I noted a convergence problem. John Maindonald email: [EMAIL PROTECTED] phone : +61 2 (6125)3473fax : +61 2(6125)5549 Mathematical Sciences Institute, Room 1194, John Dedman Mathematical Sciences Building (Building 27) Australian National University, Canberra ACT 0200. On 28 Feb 2006, at 10:00 PM, Christian Gold wrote: From: Christian Gold [EMAIL PROTECTED] Date: 28 February 2006 2:15:04 AM To: r-help@stat.math.ethz.ch Subject: [R] repeated measures ANOVA Dear list members: I have the following data: group - rep(rep(1:2, c(5,5)), 3) time - rep(1:3, rep(10,3)) subject - rep(1:10, 3) p.pa - c(92, 44, 49, 52, 41, 34, 32, 65, 47, 58, 94, 82, 48, 60, 47, 46, 41, 73, 60, 69, 95, 53, 44, 66, 62, 46, 53, 73, 84, 79) P.PA - data.frame(subject, group, time, p.pa) The ten subjects were randomly assigned to one of two groups and measured three times. (The treatment changes after the second time point.) Now I am trying to find out the most adequate way for an analysis of main effects and interaction. Most social scientists would call this analysis a repeated measures ANOVA, but I understand that mixed- effects model is a more generic term for the same analysis. I did the analysis in four ways (one in SPSS, three in R): 1. In SPSS I used general linear model, repeated measures, defining a within-subject factor for the three different time points. (The data frame is structured differently in SPSS so that there is one line for each subject, and each time point is a separate variable.) Time was significant. 2. Analogous to what is recommended in the first chapter of Pinheiro Bates' Mixed-Effects Models book, I used library(nlme) summary(lme ( p.pa ~ time * group, random = ~ 1 | subject)) Here, time was NOT significant. This was surprising not only in comparison with the result in SPSS, but also when looking at the graph: interaction.plot(time, group, p.pa) 3. I then tried a code for the lme4 package, as described by Douglas Bates in RNews 5(1), 2005 (p. 27-30). The result was the same as in 2. library(lme4) summary(lmer ( p.pa ~ time * group + (time*group | subject), P.PA )) 4. The I also tried what Jonathan Baron suggests in his Notes on the use of R for psychology experiments and questionnaires (on CRAN): summary( aov ( p.pa ~ time * group + Error(subject/(time * group)) ) ) This gives me yet another result. So I am confused. Which one should I use? Thanks Christian __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Re: [R] repeated measures ANOVA
There was a mistake in my earlier note, that I should correct: (Or you can estimate the interaction, and no degrees of freedom are left for either the time or time:group random effect). All you can talk ^^^ about is the average and the difference of the time effects for these two specific groups. There is no problem with the time random effect; that can be estimated from the within group variation in slopes, between subjects. John Maindonald email: [EMAIL PROTECTED] phone : +61 2 (6125)3473fax : +61 2(6125)5549 Mathematical Sciences Institute, Room 1194, John Dedman Mathematical Sciences Building (Building 27) Australian National University, Canberra ACT 0200. __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
[R] In which application areas is R used?
If anyone has a list of application areas where there is extensive use of R, I'd like to hear of it. My current short list is: Bioinformatics Epidemiology Geophysics Agriculture and crop science John Maindonald Mathematical Sciences Institute, Australian National University. [EMAIL PROTECTED] __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Re: [R] In which application areas is R used?
In this context extensive might be use of R in at least maybe 2% or 5% of the published analyses in the area, enough to make waves and stir awareness. The immediate subtext is the demand of a book publisher for a list of journals to which a new edition of a certain book might be sent for review, and for a list of conferences where it might be given exposure. For myself, in the medium to longer term, I am more interested in other subtexts such as you mention, to which the answer might have relevance. I've wondered what support there'd be for starting a database of bibliographic information on papers where R was used for the analysis. Authors might supply the information, or readers of a paper suggest its addition to the database. Once well populated, this would provide a useful indication of the range of application areas and journals where R is finding use. [Or has someone, somewhere, already started such a database?] Finance and biostatistics are obvious areas that I'd omitted. Other areas drawn to my attention have been telephony and electronic networks, solid state etc manufacturing, computer system performance, oceanography and fisheries research, risk analysis, process engineering and marketing. (I hope my summaries are acceptably accurate). I'm not sure what force these other respondents have given the word extensive. John Maindonald Mathematical Sciences Institute Australian National University. [EMAIL PROTECTED] Berton Gunter wrote: Define extensive. I think your answers depend on your definition. I know a bunch of folks in pharmaceutical preclinical RD who use R for all sorts of stuff (analysis and visualization of tox and efficacy animal studies, dose/response modeling, PK work, IC50 determination, stability data analysis, etc.). Is bunch a majority? I strongly doubt that it's near. Is it 5%, 10%, 30% ?? Dunno. Excel is still the Big Boy in most of these arenas I would bet. But I would also bet that there are at least 1 or 2 folks in dozens of companies who use R in for these things. Is there a subtext to your query? -- i.e. are you trying to make an argument for something? -- Bert -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Re: [R] Splitting the list
I've changed the heading because this really is another thread. I think it inevitable that there will, in the course of time, be other lists that are devoted, in some shape or form, to the concerns of practitioners (at all levels) who are using R. One development I'd not like to see is fracture along application area lines, allowing those who are comfortable in coteries whose focus was somewhat relevant to standards of use of statistics in that area 15 or 20 years ago to continue that way. One of the great things about R, in its development to date, has been its role in exposing people from a variety of application area communities to statistical traditions different from that in which they have been nurtured. I expect it to have a continuing role in raising statistical analysis standards, in raising the bar. Another possibility is fracture along geographic boundaries. This has both benefits (one being that its is easier within a smaller circle of people who are more likely to know each other for contributors to establish a rapport that will make the list really effective; also there will be notices and discussion that are of local interest) and drawbacks (it risks separating subscribers off from important discussions on the official R lists.) On balance, this may be the better way to go. Indeed subscribers to ANZSTAT (Australian and NZ statistical list) will know that an R-downunder list, hosted at Auckland, is currently in test-drive mode. There should be enough subscribers in common between this and the official R lists that the south-eastern portion of Gondwana does not, at any time in the very near future, float off totally on its own. There are of course other possibilities, and it may be useful to canvass them. John Maindonald email: [EMAIL PROTECTED] phone : +61 2 (6125)3473fax : +61 2(6125)5549 Mathematical Sciences Institute, Room 1194, John Dedman Mathematical Sciences Building (Building 27) Australian National University, Canberra ACT 0200. On 4 Jan 2006, at 10:00 PM, [EMAIL PROTECTED] wrote: From: Ben Fairbank [EMAIL PROTECTED] Date: 4 January 2006 4:42:31 AM To: R-help@stat.math.ethz.ch Subject: Re: [R] A comment about R: One implicit point in Kjetil's message is the difficulty of learning enough of R to make its use a natural and desired first choice alternative, which I see as the point at which real progress and learning commence with any new language. I agree that the long learning curve is a serious problem, and in the past I have discussed, off list, with one of the very senior contributors to this list the possibility of splitting the list into sections for newcomers and for advanced users. He gave some very cogent reasons for not splitting, such as the possibility of newcomers' getting bad advice from others only slightly more advanced than themselves. And yet I suspect that a newcomers' section would encourage the kind of mutually helpful collegiality among newcomers that now characterizes the exchanges of the more experienced users on this list. I know that I have occasionally been reluctant to post issues that seem too elementary or trivial to vex the others on the list with and so have stumbled around for an hour or so seeking the solution to a simple problem. Had I the counsel of others similarly situated progress might have been far faster. Have other newcomers or occasional users had the same experience? Is it time to reconsider splitting this list into two sections? Certainly the volume of traffic could justify it. Ben Fairbank [[alternative HTML version deleted]] __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Re: [R] R] lme X lmer results
From a quick look at the paper in the SAS proceedings, the simulations seem limited to nested designs. The major problems are with repeated measures designs where the error structure is not compound symmetric, which lme4 does not at present handle (unless I have missed something). Such imbalance as was investigated was not a serious issue, at least for the Kenward and Roger degree of freedom calculations. The paper ends by commenting that research should continue. What may be even more important is to educate users to think carefully about any df that they are presented with, and to be especially sceptical when designs are not approximately balanced nested designs and/or there are repeated measures error structures that are not compound symmetric. It is also necessary to consider how well the analysis reflects matters on which there may be existing good evidence. Suppose in Ronaldo's case that he'd previously run a number of experiments with very similar plots and observation al units, and with comparable treatments and outcome measures. If the plot 1 SD estimate (i.e., at the level of experimental units) had never been larger than 0.01, with the SD for observational units always in a range of 2 to 20, I'd take this as licence to ignore the variance at the plot 1 level. It would be nice to be able to build in such prior information more formally, probably via a modified version of mcmcsamp(). [Some people are never satisfied, You've written a great piece of software, and users reward you by complaining that they want even more!] John Maindonald. On 1 Jan 2006, at 10:00 PM, [EMAIL PROTECTED] wrote: From: Dave Atkins [EMAIL PROTECTED] Date: 1 January 2006 1:40:45 AM To: r-help@stat.math.ethz.ch Subject: Re: [R] lme X lmer results Message: 18 Date: Fri, 30 Dec 2005 12:51:59 -0600 From: Douglas Bates [EMAIL PROTECTED] Subject: Re: [R] lme X lmer results To: John Maindonald [EMAIL PROTECTED] Cc: r-help@stat.math.ethz.ch Message-ID: [EMAIL PROTECTED] Content-Type: text/plain; charset=ISO-8859-1 On 12/29/05, John Maindonald [EMAIL PROTECTED] wrote: Surely there is a correct denominator degrees of freedom if the design is balanced, as Ronaldo's design seems to be. Assuming that he has specified the design correctly to lme() and that lme() is getting the df right, the difference is between 2 df and 878 df. If the t- statistic for the second level of Xvar had been 3.0 rather than 1.1, the difference would be between a t-statistic equal to 0.095 and 1e-6. In a design where there are 10 observations on each experimental unit, and all comparisons are at the level of experimental units or above, df for all comparisons will be inflated by a factor of at least 9. Doug Bates commented: I don't want to be obtuse and argumentative but I still am not convinced that there is a correct denominator degrees of freedom for _this_ F statistic. I may be wrong about this but I think you are referring to an F statistic based on a denominator from a different error stratum, which is not what is being quoted. (Those are not given because they don't generalize to unbalanced designs.) This is why I would like to see someone undertake a simulation study to compare various approaches to inference for the fixed effects terms in a mixed model, using realistic (i.e. unbalanced) examples. Doug-- Here is a paper that focused on the various alternatives to denominator degrees of freedom in SAS and does report some simulation results: http://www2.sas.com/proceedings/sugi26/p262-26.pdf Not sure whether it argues convincingly one way or the other in the present discussion. cheers, Dave -- Dave Atkins, PhD [EMAIL PROTECTED] __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Re: [R] lme X lmer results
contribute equally to sd1^2, and sd1^2 usually tracks quite closely to a chi-squared distribution. John Maindonald. On 31 Dec 2005, at 5:51 AM, Douglas Bates wrote: On 12/29/05, John Maindonald [EMAIL PROTECTED] wrote: Surely there is a correct denominator degrees of freedom if the design is balanced, as Ronaldo's design seems to be. Assuming that he has specified the design correctly to lme() and that lme() is getting the df right, the difference is between 2 df and 878 df. If the t-statistic for the second level of Xvar had been 3.0 rather than 1.1, the difference would be between a t-statistic equal to 0.095 and 1e-6. In a design where there are 10 observations on each experimental unit, and all comparisons are at the level of experimental units or above, df for all comparisons will be inflated by a factor of at least 9. I don't want to be obtuse and argumentative but I still am not convinced that there is a correct denominator degrees of freedom for _this_ F statistic. I may be wrong about this but I think you are referring to an F statistic based on a denominator from a different error stratum, which is not what is being quoted. (Those are not given because they don't generalize to unbalanced designs.) This is why I would like to see someone undertake a simulation study to compare various approaches to inference for the fixed effects terms in a mixed model, using realistic (i.e. unbalanced) examples. It seems peculiar to me that the F statistics are being created from the ratios of mean squares for different terms to the _same_ mean square (actually a penalized sum of squares divided by the degrees of freedom) and the adjustment suggested to take into account the presence of the random effects is to change the denominator degrees of freedom. I think the rationale for this is an attempt to generalized another approach (the use of error strata) even though it is not being used here. Rather than giving df that for the comparison(s) of interest may be highly inflated, I'd prefer to give no degrees of freedom at all, to encourage users to work out df for themselves if at all possible. If they are not able to do this, then mcmcsamp() is a good alternative, and may be the way to go in any case. This has the further advantage of allowing assessments in cases where the relevant distribution is hard to get at. I'd think a warning in order that the df are upper bounds, and may be grossly inflated. As I said, I am willing to change this if it is shown to be grossly inaccurate but please show me. Incidentally, does mcmcsamp() do its calculations pretty well independently of the lmer results? mcmcsamp starts from the parameter estimates when creating the chain but that is the extent to which it depends on the lmer results. John Maindonald. On 29 Dec 2005, at 10:00 PM, [EMAIL PROTECTED] wrote: From: Douglas Bates [EMAIL PROTECTED] Date: 29 December 2005 5:59:07 AM To: Ronaldo Reis-Jr. [EMAIL PROTECTED] Cc: R-Help r-help@stat.math.ethz.ch Subject: Re: [R] lme X lmer results On 12/26/05, Ronaldo Reis-Jr. [EMAIL PROTECTED] wrote: Hi, this is not a new doubt, but is a doubt that I cant find a good response. Look this output: m.lme - lme(Yvar~Xvar,random=~1|Plot1/Plot2/Plot3) anova(m.lme) numDF denDF F-value p-value (Intercept) 1 860 210.2457 .0001 Xvar1 2 1.2352 0.3821 summary(m.lme) Linear mixed-effects model fit by REML Data: NULL AIC BIClogLik 5416.59 5445.256 -2702.295 Random effects: Formula: ~1 | Plot1 (Intercept) StdDev: 0.000745924 Formula: ~1 | Plot2 %in% Plot1 (Intercept) StdDev: 0.000158718 Formula: ~1 | Plot3 %in% Plot2 %in% Plot1 (Intercept) Residual StdDev: 0.000196583 5.216954 Fixed effects: Yvar ~ Xvar Value Std.Error DF t-value p-value (Intercept)2.3545454 0.2487091 860 9.467066 0. XvarFactor20.3909091 0.3517278 2 1.111397 0.3821 Number of Observations: 880 Number of Groups: Plot1 Plot2 %in% Plot1 4 8 Plot3 %in% Plot2 %in% Plot1 20 This is the correct result, de correct denDF for Xvar. I make this using lmer. m.lmer - lmer(Yvar~Xvar+(1|Plot1)+(1|Plot1:Plot2)+(1|Plot3)) anova(m.lmer) Analysis of Variance Table Df Sum Sq Mean Sq Denom F value Pr(F) Xvar 1 33.62 33.62 878.00 1.2352 0.2667 summary(m.lmer) Linear mixed-effects model fit by REML Formula: Yvar ~ Xvar + (1 | Plot1) + (1 | Plot1:Plot2) + (1 | Plot3) AIC BIClogLik MLdeviance REMLdeviance 5416.59 5445.27 -2702.295 5402.698 5404.59 Random effects: GroupsNameVariance Std.Dev. Plot3 (Intercept) 1.3608e-08 0.00011665 Plot1:Plot2 (Intercept) 1.3608e-08 0.00011665 Plot1 (Intercept) 1.3608e
[R] lme X lmer results
Surely there is a correct denominator degrees of freedom if the design is balanced, as Ronaldo's design seems to be. Assuming that he has specified the design correctly to lme() and that lme() is getting the df right, the difference is between 2 df and 878 df. If the t-statistic for the second level of Xvar had been 3.0 rather than 1.1, the difference would be between a t-statistic equal to 0.095 and 1e-6. In a design where there are 10 observations on each experimental unit, and all comparisons are at the level of experimental units or above, df for all comparisons will be inflated by a factor of at least 9. Rather than giving df that for the comparison(s) of interest may be highly inflated, I'd prefer to give no degrees of freedom at all, to encourage users to work out df for themselves if at all possible. If they are not able to do this, then mcmcsamp() is a good alternative, and may be the way to go in any case. This has the further advantage of allowing assessments in cases where the relevant distribution is hard to get at. I'd think a warning in order that the df are upper bounds, and may be grossly inflated. Incidentally, does mcmcsamp() do its calculations pretty well independently of the lmer results? John Maindonald. On 29 Dec 2005, at 10:00 PM, [EMAIL PROTECTED] wrote: From: Douglas Bates [EMAIL PROTECTED] Date: 29 December 2005 5:59:07 AM To: Ronaldo Reis-Jr. [EMAIL PROTECTED] Cc: R-Help r-help@stat.math.ethz.ch Subject: Re: [R] lme X lmer results On 12/26/05, Ronaldo Reis-Jr. [EMAIL PROTECTED] wrote: Hi, this is not a new doubt, but is a doubt that I cant find a good response. Look this output: m.lme - lme(Yvar~Xvar,random=~1|Plot1/Plot2/Plot3) anova(m.lme) numDF denDF F-value p-value (Intercept) 1 860 210.2457 .0001 Xvar1 2 1.2352 0.3821 summary(m.lme) Linear mixed-effects model fit by REML Data: NULL AIC BIClogLik 5416.59 5445.256 -2702.295 Random effects: Formula: ~1 | Plot1 (Intercept) StdDev: 0.000745924 Formula: ~1 | Plot2 %in% Plot1 (Intercept) StdDev: 0.000158718 Formula: ~1 | Plot3 %in% Plot2 %in% Plot1 (Intercept) Residual StdDev: 0.000196583 5.216954 Fixed effects: Yvar ~ Xvar Value Std.Error DF t-value p-value (Intercept)2.3545454 0.2487091 860 9.467066 0. XvarFactor20.3909091 0.3517278 2 1.111397 0.3821 Number of Observations: 880 Number of Groups: Plot1 Plot2 %in% Plot1 4 8 Plot3 %in% Plot2 %in% Plot1 20 This is the correct result, de correct denDF for Xvar. I make this using lmer. m.lmer - lmer(Yvar~Xvar+(1|Plot1)+(1|Plot1:Plot2)+(1|Plot3)) anova(m.lmer) Analysis of Variance Table Df Sum Sq Mean Sq Denom F value Pr(F) Xvar 1 33.62 33.62 878.00 1.2352 0.2667 summary(m.lmer) Linear mixed-effects model fit by REML Formula: Yvar ~ Xvar + (1 | Plot1) + (1 | Plot1:Plot2) + (1 | Plot3) AIC BIClogLik MLdeviance REMLdeviance 5416.59 5445.27 -2702.295 5402.698 5404.59 Random effects: GroupsNameVariance Std.Dev. Plot3 (Intercept) 1.3608e-08 0.00011665 Plot1:Plot2 (Intercept) 1.3608e-08 0.00011665 Plot1 (Intercept) 1.3608e-08 0.00011665 Residual 2.7217e+01 5.21695390 # of obs: 880, groups: Plot3, 20; Plot1:Plot2, 8; Plot1, 4 Fixed effects: Estimate Std. Error DF t value Pr(|t|) (Intercept) 2.354550.24871 878 9.4671 2e-16 *** XvarFactor2 0.390910.35173 878 1.1114 0.2667 --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Look the wrong P value, I know that it is wrong because the DF used. But, In this case, the result is not correct. Dont have any difference of the result using random effects with lmer and using a simple analyses with lm. You are assuming that there is a correct value of the denominator degrees of freedom. I don't believe there is. The statistic that is quoted there doesn't have exactly an F distribution so there is no correct degrees of freedom. One thing you can do with lmer is to form a Markov Chain Monte Carlo sample from the posterior distribution of the parameters so you can check to see whether the value of zero is in the middle of the distribution of XvarFactor2 or not. It would be possible for me to recreate in lmer the rules used in lme for calculating denominator degrees of freedom associated with terms of the random effects. However, the class of models fit by lmer is larger than the class of models fit by lme (at least as far as the structure of the random-effects terms goes). In particular lmer allows for random effects associated with crossed or partially crossed grouping factors and the rules for denominator degrees of freedom in lme only
[R] arima: warning when fixing MA parameters.
I am puzzled by the warning message in the output below. It appears whether or not I fit the seasonal term (but the precise point of doing this was to fit what is effectively a second seasonal term). Is there some deep reason why AR parameters (Warning message: some AR parameters were fixed: ...) should somehow intrude into the fitting of a model that has only MA terms? library(DAAG) attach(bomsoi) # The following is fine: arima(avrain, order=c(0,0,4), seasonal=list(order=c(0,0,1), period=12), + fixed=c(NA,0,0,NA,NA,NA)) . # The following generates a warning message arima(avrain, order=c(0,0,4), seasonal=list(order=c(0,0,1), period=12), + fixed=c(0,0,0,NA,NA,NA)) Call: arima(x = avrain, order = c(0, 0, 4), seasonal = list(order = c(0, 0, 1), period = 12), fixed = c(0, 0, 0, NA, NA, NA)) Coefficients: ma1 ma2 ma3 ma4 sma1 intercept 000 0.0357 -0.1061 456.6675 s.e.000 0.1015 0.0886 7.6997 sigma^2 estimated as 6849: log likelihood = -595.23, aic = 1198.46 Warning message: some AR parameters were fixed: setting transform.pars = FALSE in: arima(avrain, order = c(0, 0, 4), seasonal = list(order = c(0, John Maindonald email: [EMAIL PROTECTED] phone : +61 2 (6125)3473fax : +61 2(6125)5549 Centre for Bioinformation Science, Room 1194, John Dedman Mathematical Sciences Building (Building 27) Australian National University, Canberra ACT 0200. __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Re: [R] Multiple expressions, when using substitute()
Yes, I did get a very helpful reply from Marc Schwartz. I have had substitute() working in legend(), when the legend argument has length one. The challenge was to find some way to do the equivalent of substitute() when several expressions appear in parallel, as may be required for legend(). The trick is to use bquote() to do the substitution. The resulting quoted expression (of mode call) can then be an element in a list, along with other quoted (or bquoted) expressions. The list elements, when passed to expression() via the args argument of do.call(), become unquoted expressions. Note that bquote() uses a syntax for the substitution of variables that is different from that used by substitute(). It would be useful to include some such example as below on the help page for bquote(): library(DAAG) Acmena - subset(rainforest, species=Acmena) plot(wood~dbh, data=Acmena) Acmena.lm - lm(log(wood) ~ log(dbh), data=Acmena) b - round(coef(Acmena.lm), 3) arg1 - bquote(italic(y) == .(A) * italic(x)^.(B), list(A=b[1], B=b[2])) arg2 - quote(where * italic(y) * =wood; * italic(x) * =dbh) legend(topleft, legend=do.call(expression, c(arg1, arg2)), bty=n) John Maindonald. On 11 Oct 2005, at 11:41 AM, Spencer Graves wrote: Have you received a reply to this post? I couldn't find one, and I couldn't find a solution, even though one must exist. I can get the substitute to work in main but not legend: B - 2:3 eB - substitute(y==a*x^b, list(a=B[1], b=B[2])) plot(1:2, 1:2, main=eB) You should be able to construct it using mtext, but I couldn't get the desired result using legend. hope this helps. spencer graves John Maindonald wrote: expression() accepts multiple expressions as arguments, thus: plot(1:2, 1:2) legend(topleft, expression(y == a * x^b, where * paste(y==wood; , x==dbh))) Is there a way to do this when values are to be substituted for a and b? i.e., the first element of the legend argument to legend() becomes, effectively: substitute(y == a * x^b, list(a = B[1], b=B[2])) John Maindonald email: [EMAIL PROTECTED] phone : +61 2 (6125)3473fax : +61 2(6125)5549 Centre for Bioinformation Science, Room 1194, John Dedman Mathematical Sciences Building (Building 27) Australian National University, Canberra ACT 0200. __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting- guide.html -- Spencer Graves, PhD Senior Development Engineer PDF Solutions, Inc. 333 West San Carlos Street Suite 700 San Jose, CA 95110, USA [EMAIL PROTECTED] www.pdf.com http://www.pdf.com Tel: 408-938-4420 Fax: 408-280-7915 John Maindonald email: [EMAIL PROTECTED] phone : +61 2 (6125)3473fax : +61 2(6125)5549 Centre for Bioinformation Science, Room 1194, John Dedman Mathematical Sciences Building (Building 27) Australian National University, Canberra ACT 0200. __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Re: [R] R/S-Plus equivalent to Genstat predict
As an alternative to the effects package, try predict() with type=terms JM On 7 Oct 2005, at 8:00 PM, Peter Dunn wrote: From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Peter Dunn Sent: Wednesday, October 05, 2005 9:06 PM To: R-help mailing list Subject: [R] R/S-Plus equivalent to Genstat predict: predictions over averages of covariates Hi all I'm doing some things with a colleague comparing different sorts of models. My colleague has fitted a number of glms in Genstat (which I have never used), while the glm I have been using is only available for R. He has a spreadsheet of fitted means from each of his models obtained from using the Genstat predict function. For example, suppose we fit the model of the type glm.out - glm( y ~ factor(F1) + factor(F2) + X1 + poly(X2,2) + poly(X3,2), family=...) Then he produces a table like this (made up, but similar): F1(level1)12.2 F1(level2)14.2 F1(level3)15.3 F2(level1)10.3 F2(level2)9.1 X1=010.2 X1=0.510.4 X1=1 10.4 X1=1.510.5 X1=210.9 X1=2.511.9 X1=311.8 X2=012.0 X2=0.512.2 X2=1 12.5 X2=1.512.9 X2=213.0 X2=2.513.1 X2=313.5 Each of the numbers are a predicted mean. So when X1=0, on average we predict an outcome of 10.2. To obtain these figures in Genstat, he uses the Genstat predict function. When I asked for an explanation of how it was done (ie to make the predictions, what values of the other covariates were used) I was told: So, for a one-dimensional table of fitted means for any factor (or variate), all other variates are set to their average values; and the factor constants (including the first, at zero) are given a weighted average depending on their respective numbers of observations. So for quantitative variables (such as pH), one uses the mean pH in the data set when making the predictions. Reasonable anmd easy. But for categorical variables (like Month), he implies we use a weighted average of the fitted coefficients for all the months, depending on the proportion of times those factor levels appear in the data. (I hope I explained that OK...) Is there an equivalent way in R or S-Plus of doing this? I have to do it for a number of sites and species, so an automated way would be useful. I have tried searching to no avail (but may not be searching on the correct terms), and tried hard-coding something myself as yet unsuccessfully: The poly terms and the use of the weighted averaging over the factor levels are proving a bit too much for my limited skills. Any assistance appreciated. (Any clarification of what I mean can be provided if I have not been clear.) Thanks, as always. P. __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
[R] Multiple expressions, when using substitute()
expression() accepts multiple expressions as arguments, thus: plot(1:2, 1:2) legend(topleft, expression(y == a * x^b, where * paste(y==wood; , x==dbh))) Is there a way to do this when values are to be substituted for a and b? i.e., the first element of the legend argument to legend() becomes, effectively: substitute(y == a * x^b, list(a = B[1], b=B[2])) John Maindonald email: [EMAIL PROTECTED] phone : +61 2 (6125)3473fax : +61 2(6125)5549 Centre for Bioinformation Science, Room 1194, John Dedman Mathematical Sciences Building (Building 27) Australian National University, Canberra ACT 0200. __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Re: [R] R-help Digest, Vol 31, Issue 30
With lme4, use of mcmcsamp can be insightful. (Douglas Bates drew my attention to this function in a private exchange of emails.) The distributions of random effects are simulated on a log scale, where the distributions are much closer to symmetry than on the scale of the random effects themselves. As far as I can see, this is a straightforward use of MCMC to estimate model parameters; it is not clear to me the results from the lmer() fit are used. John Maindonald. On 30 Sep 2005, at 8:00 PM, [EMAIL PROTECTED] wrote: From: Roel de Jong [EMAIL PROTECTED] Date: 29 September 2005 11:19:38 PM To: r-help@stat.math.ethz.ch Subject: [R] standard error of variances and covariances of the randomeffects with LME Hello, how do I obtain standard errors of variances and covariances of the random effects with LME comparable to those of for example MlWin? I know you shouldn't use them because the distribution of the estimator isn't symmetric blablabla, but I need a measure of the variance of those estimates for pooling my multiple imputation results. Regards, Roel. John Maindonald email: [EMAIL PROTECTED] phone : +61 2 (6125)3473fax : +61 2(6125)5549 Centre for Bioinformation Science, Room 1194, John Dedman Mathematical Sciences Building (Building 27) Australian National University, Canberra ACT 0200. __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Re: [R] Discrepancy between R and SPSS in 2-way, repeated measures ANOVA
For the record, it turns out that EXPNO ran from 1 to 20, i.e., it identified subject. Thus EXPNO/COND parsed into the two error terms (additional to residual) EXPNO and EXPNO:COND. This second error term accounts for all variation between levels of COND; so there is no COND sum of squares. (In SPSS the fixed effect COND may have taken precedence; I do not know for sure.) In R, if this was a complete randomized design, the term Error(EXPO), or in the mock-up example I gave Error(subj), would be enough on its own. The R implementation can handle error terms akin to Error(REPNO/subj), but because there are redundant model matrix columns generated by the REPNO:subj term, complains that the Error() model is singular. In general, terms of the form a/b should be used only if b is nested within a, i.e., REPNO/IndividualWithinBlock (where IndividualWithinBlock runs from 1 to 4) not REPNO/subj. (Either of these cause REPNO to be treated as a blocking factor). xy - expand.grid(REPNO=letters[1:5], COND=letters[1:4], +TIME=factor(paste(1:2))) xy$subj - factor(paste(xy$REPNO, xy$COND, sep=:)) ## Below subj becomes EXPNO xy$COND - factor(xy$COND) xy$REPNO - factor(xy$REPNO) xy$y - rnorm(40) Plea to those who post such questions to the list: ~ Please Include either a toy data set or, if the actual data set is small, lists of factor values. If you are happy to make the information public, give the result vector also (this is less important!) Or you can put the data and, where relevant, your code, on a web site. Be careful about the use of the word groups in an experimental design context; speak of treatment groups if that is the meaning, or blocks if that is what is intended. I suspect that confusion between these two contexts in which the word groups is wont to be used lay behind the use of the EXPNO/COND form of model formula. John Maindonald. On 10 Sep 2005, at 8:00 PM, Larry A Sonna wrote: From: Larry A Sonna [EMAIL PROTECTED] Date: 10 September 2005 12:10:06 AM To: r-help@stat.math.ethz.ch Subject: [R] Discrepancy between R and SPSS in 2-way, repeated measures ANOVA Dear R community, I am trying to resolve a discrepancy between the way SPSS and R handle 2-way, repeated measures ANOVA. An experiment was performed in which samples were drawn before and after treatment of four groups of subjects (control and disease states 1, 2 and 3). Each group contained five subjects. An experimental measurement was performed on each sample to yield a signal. The before and after treatment signals for each subject were treated as repeated measures. We desire to obtain P values for disease state (CONDITION), and the interaction between signal over time and disease state (CONDITION*TIME). Using SPSS, the following output was obtained: DFSumSq (Type 3)Mean SqF value P= COND 3 4286114287 3.645 0.0355 TIME1 473 473 0.175 0.681 COND*TIME 3 975 325 0.120 0.947 Error1643219 2701 By contrast, using the following R command: summary(aov(SIGNAL~(COND+TIME+COND*TIME)+Error(EXPNO/COND), Type=III)) the output was as follows: Df Sum Sq Mean Sq F value Pr(F) COND 3 26516 8839 3.2517 0.03651 * TIME1473 473 0.1739 0.67986 COND:TIME 3975 325 0.1195 0.94785 Residuals 2876107 2718 I don't understand why the two results are discrepant. In particular, I'm not sure why R is yielding 28 DF for the residuals whereas SPSS only yields 16. Can anyone help? John Maindonald email: [EMAIL PROTECTED] phone : +61 2 (6125)3473fax : +61 2(6125)5549 Centre for Bioinformation Science, Room 1194, John Dedman Mathematical Sciences Building (Building 27) Australian National University, Canberra ACT 0200. John Maindonald email: [EMAIL PROTECTED] phone : +61 2 (6125)3473fax : +61 2(6125)5549 Centre for Bioinformation Science, Room 1194, John Dedman Mathematical Sciences Building (Building 27) Australian National University, Canberra ACT 0200. __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Re: [R] Discrepancy between R and SPSS in 2-way, repeated measures ANOVA
There are 20 distinct individuals, right? expno breaks the 20 individuals into five groups of 4, right? Is this a blocking factor? If expno is treated as a blocking factor, the following is what you get: xy - expand.grid(expno=letters[1:5],cond=letters[1:4], +time=factor(paste(1:2))) xy$subj - factor(paste(xy$expno, xy$cond, sep=:)) xy$cond - factor(xy$cond) xy$expno - factor(xy$expno) xy$y - rnorm(40) summary(aov(y~cond*time+Error(expno/cond), data=xy)) Error: expno Df Sum Sq Mean Sq F value Pr(F) Residuals 4 3.590.90 Error: expno:cond Df Sum Sq Mean Sq F value Pr(F) cond 3 1.060.350.36 0.78 Residuals 12 11.860.99 Error: Within Df Sum Sq Mean Sq F value Pr(F) time 1 2.272.271.38 0.26 cond:time 3 3.271.090.67 0.59 Residuals 16 26.191.64 If on the other hand this is analyzed as for a complete randomized design, the following is the output: summary(aov(y~cond*time+Error(subj), data=xy)) Error: subj Df Sum Sq Mean Sq F value Pr(F) cond 3 1.060.350.37 0.78 Residuals 16 15.460.97 Error: Within Df Sum Sq Mean Sq F value Pr(F) time 1 2.272.271.38 0.26 cond:time 3 3.271.090.67 0.59 Residuals 16 26.191.64 On 10 Sep 2005, at 8:00 PM, Larry A Sonna wrote: From: Larry A Sonna [EMAIL PROTECTED] Date: 10 September 2005 12:10:06 AM To: r-help@stat.math.ethz.ch Subject: [R] Discrepancy between R and SPSS in 2-way, repeated measures ANOVA Dear R community, I am trying to resolve a discrepancy between the way SPSS and R handle 2-way, repeated measures ANOVA. An experiment was performed in which samples were drawn before and after treatment of four groups of subjects (control and disease states 1, 2 and 3). Each group contained five subjects. An experimental measurement was performed on each sample to yield a signal. The before and after treatment signals for each subject were treated as repeated measures. We desire to obtain P values for disease state (CONDITION), and the interaction between signal over time and disease state (CONDITION*TIME). Using SPSS, the following output was obtained: DFSumSq (Type 3)Mean SqF value P= COND 3 4286114287 3.645 0.0355 TIME1 473 473 0.175 0.681 COND*TIME 3 975 325 0.120 0.947 Error1643219 2701 By contrast, using the following R command: summary(aov(SIGNAL~(COND+TIME+COND*TIME)+Error(EXPNO/COND), Type=III)) the output was as follows: Df Sum Sq Mean Sq F value Pr(F) COND 3 26516 8839 3.2517 0.03651 * TIME1473 473 0.1739 0.67986 COND:TIME 3975 325 0.1195 0.94785 Residuals 2876107 2718 I don't understand why the two results are discrepant. In particular, I'm not sure why R is yielding 28 DF for the residuals whereas SPSS only yields 16. Can anyone help? John Maindonald email: [EMAIL PROTECTED] phone : +61 2 (6125)3473fax : +61 2(6125)5549 Centre for Bioinformation Science, Room 1194, John Dedman Mathematical Sciences Building (Building 27) Australian National University, Canberra ACT 0200. __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Re: [R] colors and palettes and things...
The DAAG package has the following: show.colors(type=c(singles, shades, grayshades), order.cols=TRUE) I am sure there are better ways to do the ordering than my ad hoc approach, though. John Maindonald email: [EMAIL PROTECTED] phone : +61 2 (6125)3473fax : +61 2(6125)5549 Centre for Bioinformation Science, Room 1194, John Dedman Mathematical Sciences Building (Building 27) Australian National University, Canberra ACT 0200. On 24 May 2005, at 11:04 AM, [EMAIL PROTECTED] wrote: From: Jeff D. Hamann [EMAIL PROTECTED] Date: 23 May 2005 4:35:27 PM To: r-help@stat.math.ethz.ch Subject: [R] colors and palettes and things... Reply-To: [EMAIL PROTECTED] After trying to find if there was a color picker in the FAQs and the help, I thought I would send a post here. I was overwhelmed with all the wonderful color choices R has predefined (discovered after typing in colors()) but can't figure out what they all (by name) look like. Is there a color picker or some other method to display all those colors next to the name? I think I can put together palettes, but another question I have then regards the building of palettes (a list of variable length I can select or create myself other than the ones defined by Palette) so I can pass these colors into functions instead of having to predefine a bunch of colors myself or use the predefined colors like terrain.colors(n)? Are there groups of colors in the colors() that I can group together to make some nice palettes for drawing barplots, etc? Thanks, Jeff. -- Jeff D. Hamann Forest Informatics, Inc. PO Box 1421 Corvallis, Oregon 97339-1421 phone 541-754-1428 fax 541-752-0288 [EMAIL PROTECTED] www.forestinformatics.com __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
[R] Garbled plot label
Marc Schwartz (MSchwartz at MedAnalytics.com) wrote Here is another option. One issue is that given the way in which plot.lm () is coded, some of the axis labels are hard coded when passed to the four underlying plot functions and as far as I can tell, there is no way to use a 'par' and just blank out the x axis labels only. Thus, both x and y axis labels need to be blanked and then separately created using title(). Thus, here is a plot.lm2() function. It's a bit kludgy, but it seems to work, though other eyes should look at it for any errors. What it effectively does is to do each of the four plots in plot.lm() individually without labels (ann = FALSE) and then adds them, generally based upon the way it is done in plot.lm(). The x axis labels are paste ()'d to the wrapped model expression to create a multi-line sub.title for each plot. The wrap.len argument is the 'width' argument for strwrap indicating the target line wrapping length. Note that if you get to around 3 lines, you will likely need to modify the margins in the plot for side 1 to provide for more room. I cannot see how to set this up so that it works in every situation. The only clean and simple way that I can see to handle the problem is to set a default that tests whether the formula is broken into multiple text elements, and if it is then omit it. Users can then use their own imaginative skills, and such suggestions as have been made on r-help, to construct whatever form of labeling best suits their case, their imaginative skills and their coding skills. The web page http://wwwmaths.anu.edu.au/~johnm/r/plot-lm/ has image files plot.lm.RData and plot6.lm.RData that are proposals for a revised version of plot.lm(). The changes made so far do not deal with the long formula issue, but if nothing better turns up my proposal will be to proceed as I have just indicated. John Maindonald email: [EMAIL PROTECTED] phone : +61 2 (6125)3473fax : +61 2(6125)5549 Centre for Bioinformation Science, Room 1194, John Dedman Mathematical Sciences Building (Building 27) Australian National University, Canberra ACT 0200. __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
[R] Odd diagnostic plots in mixed-effects models
Is the fixed effect estimated at the innermost level? If not, plots of residuals at that level are surely of limited interest. qqplots, to be relevant, surely need to assess normality of effects (rather than residuals) at the level that matters for the intended inferences. If the fixed effect is estimated at the level of the random effect, then of course there are just 12 effects that should appear in any qq or suchlike plot. John Maindonald email: [EMAIL PROTECTED] phone : +61 2 (6125)3473fax : +61 2(6125)5549 Centre for Bioinformation Science, Room 1194, John Dedman Mathematical Sciences Building (Building 27) Australian National University, Canberra ACT 0200. On 19 Apr 2005, at 8:03 PM, [EMAIL PROTECTED] wrote: From: Andrew Robinson [EMAIL PROTECTED] Date: 19 April 2005 12:41:24 PM To: r-help@stat.math.ethz.ch Subject: [R] Odd diagnostic plots in mixed-effects models Dear R community, In the excellent nlme package the default diagnostic plot graphs the innermost residuals against innermost fitted values. I recently fit a mixed-effects model in which there was a very clear positive linear trend in this plot. I inferred that this trend occurred because my fixed effect was a two-level factor, and my random effect was a 12-level factor. The negative residuals were associated with negative random effects (because of shrinkage, I assume), and the positive with positive. The fixed effects explained little varaition. Therefore plotting the innermost residuals against the innermost fitted values had the negative residuals to the left and the positive residuals to the right, occasioning a trend. My questions are: is it (as I suspect) harmless, or does it suggest that the model is lacking? And, is this effect likely to compromise the interpretation of any of the other standard diagnostic plots (eg qqnorm)? Thanks much for any thoughts, Andrew -- Andrew Robinson Ph: 208 885 7115 Department of Forest Resources Fa: 208 885 6226 University of Idaho E : [EMAIL PROTECTED] PO Box 441133W : http://www.uidaho.edu/~andrewr Moscow ID 83843 Or: http://www.biometrics.uidaho.edu __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
[R] Base and lattice graphics on the same graphics page
Although base graphics does not mix with lattice in the one graph, I've found that print.trellis(position=..., ) and the use of par(fig=...) to put regular and trellis graphics on the one graphics page works like a treat, at least in version 2.0.1 of R. [Base graphics functions that are themselves inconsistent with par(fig=...) are obviously disallowed.] I am wondering whether there are caveats of which I and others should be aware, or whether there is a risk that the ongoing development of R's graphics abilities will render such a cohabitation unworkably fractious. Example: gph - bwplot(voice.part ~ height, data=singer) print(gph, position=c(0, 0.5, 1, 1)) # x0, y0, x1, y1 par(fig=c(0, 1, 0,0.5), new=TRUE) # x0, x1, y0, y1 boxplot(height ~ voice.part, data=singer, horiz=TRUE) John Maindonald email: [EMAIL PROTECTED] phone : +61 2 (6125)3473fax : +61 2(6125)5549 Centre for Bioinformation Science, Room 1194, John Dedman Mathematical Sciences Building (Building 27) Australian National University, Canberra ACT 0200. __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
[R] Re: R-help Digest, Vol 24, Issue 28
You've omitted a comma. races2000 is a data frame, which for purposes of extracting rows behaves like a 2-dimenional object. The following works fine: hills2000 - races2000[races2000$type == 'hill', ] Additionally, you might like to ponder type - races2000[names(races2000)==type] type[1:4] Error in [.data.frame(type, 1:4) : undefined columns selected length(type) # type is a data frame with 1 column [1] 1 vtype - unlist(type)# Extract the vector that is the one # data frame (list) element vtype[1:4] # Try also length(vtype) type.type1 type.type2 type.type3 type.type4 uphillotherotherrelay Your syntax (without the comma) does give a result, providing that the dimensions match (the condition must have the same number of elements as races2000 has columns), but it is probably not the result that you want! See further pp.320-321 of the DAAG book. John Maindonald email: [EMAIL PROTECTED] phone : +61 2 (6125)3473fax : +61 2(6125)5549 Centre for Bioinformation Science, Room 1194, John Dedman Mathematical Sciences Building (Building 27) Australian National University, Canberra ACT 0200. On 28 Feb 2005, at 10:07 PM, [EMAIL PROTECTED] wrote: From: Clint Harshaw [EMAIL PROTECTED] Date: 28 February 2005 1:08:36 AM To: r-help@stat.math.ethz.ch Subject: [R] subsetting data set dimenion problem (See DAAG book, p. 173, ex. 3) I'm a new user of R, and I'm following the DAAG text. I want to create a subset of the races2000 data frame, but get errors because of a mismatch of values in some columns: library(DAAG) attach(races2000) hills2000 - races2000[races2000$type == 'hill'] Error in as.matrix.data.frame(x) : dim- : dims [product 770] do not match the length of object [771] However, if I follow the solution given, and remove redundant columns 1 through 6 and column 11 (which I won't need, since I know they are going to have the same value), I don't get the error: hills2000 - races2000[races2000$type == 'hill', -c(1:6,11)] hills2000 dist climb time timef Tiso Carnethy 6.00 2500 0.782 0.9191667 [...] Cornalees 5.50 800 0.618 NA [...] What is causing the error with my original subsetting? I speculated it was related to the NA values, but there is an NA in the resulting hills2000, corresponding to the Cornalees hill race. Thanks, Clint -- Clint Harshaw, PhD Department of Mathematics Presbyterian College Clinton SC 29325 EMail: [EMAIL PROTECTED] Phone: 864.833.8995 Fax: 864.938.3769 Office: Harrington-Peachtree Rm 412 John Maindonald email: [EMAIL PROTECTED] phone : +61 2 (6125)3473fax : +61 2(6125)5549 Centre for Bioinformation Science, Room 1194, John Dedman Mathematical Sciences Building (Building 27) Australian National University, Canberra ACT 0200. __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
[R] read.table
In addition to other suggestions made, note also count.fields(). cat(10 9 17 # First of 7 lines, 11 13 1 6, 9 14 16, + 12 15 14, 8 15 15, 9 13 12, 7 14 18, + file=oneBadRow.txt, sep=\n) nfields - count.fields(oneBadRow.txt) nfields [1] 3 4 3 3 3 3 3 table(nfields) ## Use with many records nfields 3 4 6 1 tab - table(nfields) (1:length(nfields))[nfields == 4] [1] 2 readLines(oneBadRow.txt, n=-1)[2] [1] 11 13 1 6 Note the various option settings for count.fields() John Maindonald email: [EMAIL PROTECTED] phone : +61 2 (6125)3473fax : +61 2(6125)5549 Centre for Bioinformation Science, Room 1194, John Dedman Mathematical Sciences Building (Building 27) Australian National University, Canberra ACT 0200. On 26 Feb 2005, at 10:03 PM, [EMAIL PROTECTED] wrote: From: Sean Davis [EMAIL PROTECTED] Date: 26 February 2005 7:11:48 AM To: r-help r-help@stat.math.ethz.ch Subject: [R] read.table I have a commonly recurring problem and wondered if folks would share tips. I routinely get tab-delimited text files that I need to read in. In very many cases, I get: a - read.table('junk.txt.txt',header=T,skip=10,sep=\t) Error in scan(file = file, what = what, sep = sep, quote = quote, dec = dec, : line 67 did not have 88 elements I am typically able to go through the file and find a single quote or something like that causing the problem, but with a recent set of files, I haven't been able to find such an issue. What can I do to get around this problem? I can use perl, also Thanks, Sean __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
[R] Re: R-help Digest, Vol 24, Issue 22
You need to give the model formula that gave your output. There are two sources of variation (at least), within and between locations; though it looks as though your analysis may have tried to account for this (but if so, the terms are not laid out in a way that makes for ready interpretation. The design is such (two locations) that you do not have much of a check that effects are consistent over locations. You need to check whether results really are similar for all cultivars and for all herbicides, so that it is legitimate to pool as happens in the overall analysis. If a herbicide:cultivar combination has little effect the variability may be large, while if it has a dramatic effect (kills everything!), there may be no variability to speak of. John Maindonald. On 22 Feb 2005, at 10:06 PM, [EMAIL PROTECTED] wrote: To: 'Bob Wheeler' [EMAIL PROTECTED] Cc: r-help@stat.math.ethz.ch Subject: RE: [R] power.anova.test for interaction effects Reply-To: [EMAIL PROTECTED] It's a rather complex model. A 37*4 factorial (37 cultivars[var]; 4 herbicide treatments[trt]) with three replications[rep] was carried out at two locations[loc], with different randomizations within each rep at each location. Source DF Error Term MS Loc 1 Trt*rep(loc)12314 Rep(loc) 4 Trt*rep(loc)1230.5 Trt 3 Trt*rep(loc)64.72 Trt*loc 3 Trt*rep(loc)33.42 Trt*rep(loc) 12 Residual76.78 Var 36 Var*trt*loc 93.91 Var*trt 108 Var*trt*loc 12.06 Var*trt*loc 144 Residual43.09 Residual575 NA 21.23 -Original Message- From: Bob Wheeler [mailto:[EMAIL PROTECTED] Sent: Monday, February 21, 2005 4:33 PM To: [EMAIL PROTECTED] Cc: r-help@stat.math.ethz.ch Subject: Re: [R] power.anova.test for interaction effects Your F value is so low as to make me suspect your model. Where did the 144 denominator degrees of freedom come from? John Maindonald email: [EMAIL PROTECTED] phone : +61 2 (6125)3473fax : +61 2(6125)5549 Centre for Bioinformation Science, Room 1194, John Dedman Mathematical Sciences Building (Building 27) Australian National University, Canberra ACT 0200. __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
[R] is.matrix(), as.matrix, as(,matrix)
Under help(matrix) it is written: 'is.matrix' tests if its argument is a (strict) matrix. It is generic: you can write methods to handle specific classes of objects, see InternalMethods. Further down, under Details, the meaning of strict is explained more explicitly: 'is.matrix' returns 'TRUE' if 'x' is a matrix (i.e., it is _not_ a 'data.frame' and has a 'dim' attribute of length 2) and 'FALSE' (i.e., strict means matrix in a broad sense) # The following is consistent with this: tab - with(ToothGrowth, table(supp, dose)) is.matrix(tab) [1] TRUE class(as.matrix(tab)) [1] table However the function as() has an argument strict that has the connotation restricted sense: strict: A logical flag. If 'TRUE', the returned object must be strictly from the target class (unless that class is a virtual class, in which case the object will be from the closest actual class (often the original object, if that class extends the virtual class directly). # The following is consistent with this: class(as(tab,matrix, strict=TRUE)) # TRUE is the default [1] matrix class(as(tab,matrix, strict=FALSE)) # TRUE is the default [1] table # Note also: class(data.matrix(tab)) [1] table At the very least, the word (strict) should be removed from 'is.matrix' tests if its argument is a (strict) matrix. and replaced by something like (in the broad sense defined below). It would make for greater consistency and convenience if all of as.matrix(), is.matrix() and data.matrix() had arguments strict, where strict=FALSE would preserve the present behaviour. Additionally, it would be useful to mention, under the documentation for matrix() and data.matrix(), that as.matrix(tab) is equivalent to as(tab, matrix, strict=FALSE) (is that strictly correct?) I often want to use xtable() with tables. There is no method defined for the class table. After a bit of rummaging, I found that I can use: xtable(as(tab, matrix)). John Maindonald email: [EMAIL PROTECTED] phone : +61 2 (6125)3473fax : +61 2(6125)5549 Centre for Bioinformation Science, Room 1194, John Dedman Mathematical Sciences Building (Building 27) Australian National University, Canberra ACT 0200. __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Re: [R] classification for huge datasets: SVM yields memory troubles
While it is true that the large number of variables relative to the number of observations restricts what can be inferred, the situation is not as hopeless as Bert seems to suggest. If it were, attempts at the analysis of expression array data would be a waste to time. Methods developed to that general area may well be relevant to other data where the number of variables is similarly far larger than the number of observations. See Ambroise, C. and Mclachlan, G.J. 2002. Selection bias in gene extraction on the basis of microarray gene-expression data. PNAS 99: 6562--6566. This discusses some of the literature on the use of SVMs. The selection bias that these authors discuss also affects plots, even principal components and other ordination-base plots where features have been selected on the basis of their ability to separate into known groups. I have draft versions of code that addresses this selection bias as it affects the plotting of graphs, which (along a paper that has been submitted for inclusion in a conference proceedings) I am happy to make available to anyone who wants to experiment. Another good place to look, as a starting point, may be Gordon Smyth's LIMMA User's Guide. This can be a bit hard to find. With limma installed, type help.start(). After some time a browser window should open. Click on Packages | limma | Overview | LIMMA User's Guide (pdf) John Maindonald email: [EMAIL PROTECTED] phone : +61 2 (6125)3473fax : +61 2(6125)5549 Centre for Bioinformation Science, Room 1194, John Dedman Mathematical Sciences Building (Building 27) Australian National University, Canberra ACT 0200. On 14 Dec 2004, at 10:09 PM, [EMAIL PROTECTED] wrote: From: Berton Gunter [EMAIL PROTECTED] Date: 14 December 2004 9:23:08 AM To: 'Andreas' [EMAIL PROTECTED], [EMAIL PROTECTED] Cc: Subject: RE: [R] classification for huge datasets: SVM yields memory troubles I have a matrix with 30 observations and roughly 3 variables, ... snipped Comment: This is ** not ** a huge data set -- it is a tiny one with a large number of covariates. The difference is: If it were truly huge, SVM and/or LDA or ... might actually be able to produce useful results. With so few data and so many variables, it is hard to see how any approach that one uses is not simply a fancy random number generator. John Maindonald email: [EMAIL PROTECTED] phone : +61 2 (6125)3473fax : +61 2(6125)5549 Centre for Bioinformation Science, Room 1194, John Dedman Mathematical Sciences Building (Building 27) Australian National University, Canberra ACT 0200. __ [EMAIL PROTECTED] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
[R] Call to trellis.focus(); thenpanel.superpose()
The following works fine with the x11 device, though it may well be that an initial plot is overwritten. With a pdf or postscript device, I get two plots, the first of which still has the red border from having the focus, while the second is the plot that I want. library(lattice); library(grid) plt - xyplot(uptake ~ conc, groups=Plant, data=CO2) print(plt) trellis.focus(panel, row=1, column=1) arglist=trellis.panelArgs() arglist$type - l do.call(panel.superpose, args=arglist) trellis.unfocus() Should I be able to use panel.superpose() in this way? The new abilities provided by trellis.focus() etc add greatly to the flexibility of what can be done with lattice plots. The grid-lattice combination is a great piece of software. John Maindonald email: [EMAIL PROTECTED] phone : +61 2 (6125)3473fax : +61 2(6125)5549 Centre for Bioinformation Science, Room 1194, John Dedman Mathematical Sciences Building (Building 27) Australian National University, Canberra ACT 0200. __ [EMAIL PROTECTED] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Re: [R] Location of grobs etc on lattice output
I'm puzzled about side effects of trellis.unfocus(): The following runs without problem, though grid.text() does not seem to do anything. (I'd thought that I had it working at one point.) library(DAAG); library(lattice); library(grid) cuckoos.strip - stripplot(species ~ length, xlab=, data=cuckoos) cuckoos.bw - bwplot(species~length, xlab=Length of egg (mm), data=cuckoos) vp0 - viewport(layout=grid.layout(2, 1)) pushViewport(vp0) vp1 - viewport(layout.pos.row=1) vp2 - viewport(layout.pos.row=2) pushViewport(vp1) print(cuckoos.strip,newpage=FALSE) # trellis.focus(panel, row=1, column=1, clip.off=TRUE) grid.text(A, x=unit(0,native), y=unit(1.05,native), gp=gpar(fontsize=9)) # trellis.unfocus() ## remove the following upViewport() upViewport() pushViewport(vp2) print(cuckoos.bw, newpage=FALSE) trellis.focus(panel, row=1, column=1, clip.off=TRUE) grid.text(B, x=unit(0,native), y=unit(1.05,native), gp=gpar(fontsize=9)) trellis.unfocus() If I remove the #'s, and remove the upViewport() that follows the second #, I seem to lose the current tree, as though the newpage=FALSE for the next print() is ignored. Should I be able to do something like this? Clearly I do not understand what happens when trellis.focus() is invoked. This seems an area where an effective GUI, with a graphical display of the viewport tree, could be very helpful. John Maindonald email: [EMAIL PROTECTED] phone : +61 2 (6125)3473fax : +61 2(6125)5549 Centre for Bioinformation Science, Room 1194, John Dedman Mathematical Sciences Building (Building 27) Australian National University, Canberra ACT 0200. On 21 Nov 2004, at 3:41 PM, Deepayan Sarkar wrote: On Saturday 20 November 2004 19:41, John Maindonald wrote: Is there any way, after use of print.trellis(), to obtain the co-ordinates of the plot region, e.g., in what are then the native co-ordinates? Have you read help(trellis.focus)? This is new in 2.0.0 and the recommended API for interacting with lattice plots (you can of course use grid tools directly, but details are more likely to change at that level). It hasn't had much testing, so I would appreciate reports of things that should be doable easily but isn't. e.g. library(DAAG) library(lattice); library(grid) data(cuckoos) pushViewport(viewport(layout=grid.layout(2, 1))) pushViewport(viewport(layout.pos.row=1)) cuckoos.strip - stripplot(species ~ length, data=cuckoos) print(cuckoos.strip, newpage=FALSE) grid.text(A, x=unit(0.18,native), y=unit(0.925,native)) # This works, but is fiddly, and needs rejigging if width # or fontsize are changed. popViewport(1) An alternative would of course be to access the co-ordinate system used by the lattice function for locating the panels, or for locating labelling. As in the example above, I have been using grid.text() to position text outside the plot region, but closer to the top axis than the legend parameter to the lattice function will allow. trellis.focus(panel, row=1, column=1, clip.off=TRUE) will put you in the plot region (panel), but will switch off clipping so you can write text outside. You can also now control the amount of space between the axis and legend, see str(trellis.par.get(layout.heights)) Deepayan __ [EMAIL PROTECTED] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html __ [EMAIL PROTECTED] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
[R] Location of grobs etc on lattice output
Is there any way, after use of print.trellis(), to obtain the co-ordinates of the plot region, e.g., in what are then the native co-ordinates? e.g. library(DAAG) library(lattice); library(grid) data(cuckoos) pushViewport(viewport(layout=grid.layout(2, 1))) pushViewport(viewport(layout.pos.row=1)) cuckoos.strip - stripplot(species ~ length, data=cuckoos) print(cuckoos.strip, newpage=FALSE) grid.text(A, x=unit(0.18,native), y=unit(0.925,native)) # This works, but is fiddly, and needs rejigging if width # or fontsize are changed. popViewport(1) An alternative would of course be to access the co-ordinate system used by the lattice function for locating the panels, or for locating labelling. As in the example above, I have been using grid.text() to position text outside the plot region, but closer to the top axis than the legend parameter to the lattice function will allow. John Maindonald email: [EMAIL PROTECTED] phone : +61 2 (6125)3473fax : +61 2(6125)5549 Centre for Bioinformation Science, Room 1194, John Dedman Mathematical Sciences Building (Building 27) Australian National University, Canberra ACT 0200. __ [EMAIL PROTECTED] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
[R] The hidden costs of GPL software?
The author of the article says nothing about the large number of hours and weeks that he surely spent learning S-plus! There should be attention to the costs that arise from a wrong or inappropriate analysis, perhaps because the software that is in use makes it difficult to do anything better, perhaps because of statistical skill limitations, often with these two factors working together. Analyses that misrepresent the science, or designs and analyses that conspire together to this end, have serious and costly implications for research. I've refereed several papers recently, in broadly ecological fields of endeavour, with seemingly quite reasonable data, where the mix of author skill and abilities of the package was clearly not up to the task in hand. Relative to getting on top of the statistical issues (for which they will probably end up getting, as they need to, statistical help), the GUI/noGUI issue will be a minor consideration, and hours or weeks spent learning R will be at most a modest consideration. John Maindonald email: [EMAIL PROTECTED] phone : +61 2 (6125)3473fax : +61 2(6125)5549 Centre for Bioinformation Science, Room 1194, John Dedman Mathematical Sciences Building (Building 27) Australian National University, Canberra ACT 0200. On 17 Nov 2004, at 10:27 PM, [EMAIL PROTECTED] wrote: From: Philippe Grosjean [EMAIL PROTECTED] Date: 17 November 2004 8:53:28 PM To: [EMAIL PROTECTED], [EMAIL PROTECTED] Cc: Subject: Hello, In the latest 'Scientific Computing World' magazine (issue 78, p. 22), there is a review on free statistical software by Felix Grant (doesn't have to pay good money to obtain good statistics software). As far as I know, this is the first time that R is even mentioned in this magazine, given that it usually discuss commercial products. In this article, the analysis of R is interesting. It is admitted that R is a great software with lots of potentials, but: All in all, R was a good lesson in the price that may have to be paid for free software: I spent many hours relearning some quite basic things taken for granted in the commercial package. Those basic things are releated with data import, obtention of basic plots, etc... with a claim for a missing more intuitive GUI in order to smooth a little bit the learning curve. There are several R GUI projects ongoing, but these are progressing very slowly. The main reason is, I believe, that a relatively low number of programmers working on R are interested by this field. Most people wanting such a GUI are basic user that do not (cannot) contribute... And if they eventually become more knowledgeable, they tend to have other interests. So, is this analysis correct: are there hidden costs for free software like R in the time required to learn it? At least currently, for the people I know (biologists, ecologists, oceanographers, ...), this is perfectly true. This is even an insurmountable barrier for many of them I know, and they have given up (they come back to Statistica, Systat, or S-PLUS using exclusively functions they can reach through menus/dialog boxes). Of course, the solution is to have a decent GUI for R, but this is a lot of work, and I wonder if the intrinsic mechanism of GPL is not working against such a development (leading to a very low pool of programmers actively involved in the elaboration of such a GUI, in comparison to the very large pool of competent developers working on R itself). Do not misunderstand me: I don't give up with my GUI project, I am just wondering if there is a general, ineluctable mechanism that leads to the current R / R GUI situation as it stands,... and consequently to a general rule that there are indeed most of the time hidden costs in free software, due to the larger time required to learn it. I am sure there are counter-examples, however, my feeling is that, for Linux, Apache, etc... the GUI (if there is one) is often a way back in comparison to the potentials in the software, leading to a steep learning curve in order to use all these features. I would be interested by your impressions and ideas on this topic. Best regards, Philippe Grosjean __ [EMAIL PROTECTED] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Re: [R] highly biased PCA data?
I'd suggest you start by using lda() or qda() from MASS, benefits being that (a) if the frequencies in the sample do not reflect the frequencies in the target population, you can set 'prior' to mirror the target frequencies. The issue is, perhaps, is your odd person odd in a 1000 dog : 100 cat owners : 10 fish population, or odd, e.g., in a 1000:1000:50 population? You can also vary the prior to see what the effect is. If however you set a large prior probability for a group that is poorly represented, results will be 'noisy'. Note the use of 'classwt' for the prior probablities for randomForest(). (b) You can plot second versus first discriminant function scores, to get a direct graphical representation of results. Other discrimination techniques may have to use an ordination technique or even lds() or qds() on a 2 dimensional representation of results, in order to get a scatterplot. [cf MDSplot() for randomForest()] John Maindonald email: [EMAIL PROTECTED] phone : +61 2 (6125)3473fax : +61 2(6125)5549 Centre for Bioinformation Science, Room 1194, John Dedman Mathematical Sciences Building (Building 27) Australian National University, Canberra ACT 0200. On 5 Nov 2004, at 10:18 PM, [EMAIL PROTECTED] wrote: From: Berton Gunter [EMAIL PROTECTED] Date: 5 November 2004 5:08:38 AM To: 'Dan Bolser' [EMAIL PROTECTED], 'R-help' [EMAIL PROTECTED] Cc: Subject: RE: [R] highly biased PCA data? Dan: 1) There is no guarantee that PCA will show separate groups, of course, as that is not its purpose, although it is frequently a side effect. 2) If you were to use a classification method of some sort (discriminant analysis, neural nets, SVM's, model=based classification, ...), my understanding is that yes, indeed, severely unbalanced group membership would, indeed, affect results. A guess is that Bayesian or other methods that could explicitly model the prior membership probabilities would do better. To make it clear why, suppose that there was a 99.9% preference of dog and .05% each of the others. Than your datasets would have almost no information on how covariates could distinguish the classes and the best classifier would be to call everything a dog no matter what values the covariates had. I presume experts will have more and better to say about this. -- Bert Gunter [mailto:[EMAIL PROTECTED] On Behalf Of Dan Bolser Sent: Thursday, November 04, 2004 9:41 AM To: R mailing list Subject: [R] highly biased PCA data? Hello, supposing that I have two or three clear categories for my data, lets say pet preferece across fish, cat, dog. Lets say most people rate their preference as being mostly one of the categories. I want to do pca on the data to see three 'groups' of people, one group for fish, one for cat and one for dog. I would like to see the odd person who likes both or all three in the (appropriate) middle of the other main groups. Will my data be affected by the fact that I have interviewed 1000 dog owners, 100 cat owners and 10 fish owners? (assuming that each scale of preference has an equal range). Cheers, dan. __ [EMAIL PROTECTED] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
[R] Using R for Data Analysis and Graphics ...
A revised version of this document, now designed to reflect version 2.0.0 of R, is available from CRAN sites, under Documentation | Contributed. Data sets are available from my web page directory: http://wwwmaths.anu.edu.au/~johnm/r/dsets Preferably, retrieve the image file usingR.Rdata The output has, mostly, not been revised. Thus it will in some cases reflect an earlier version of R. This is a task for some later time. John Maindonald email: [EMAIL PROTECTED] phone : +61 2 (6125)3473fax : +61 2(6125)5549 Centre for Bioinformation Science, Room 1194, John Dedman Mathematical Sciences Building (Building 27) Australian National University, Canberra ACT 0200. __ [EMAIL PROTECTED] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Re: [R] Lattice .ps graphic is rotated in LaTeX slides
On 2 Oct 2004, at 8:04 PM, [EMAIL PROTECTED] wrote: On Fri, 2004-10-01 at 10:58, Peter Dalgaard wrote: Michael Friendly [EMAIL PROTECTED] writes: ! Package graphics Error: Division by 0. What am I doing wrong, or how could I do it differently so it would work? You might try \usepackage{graphicx} instead. I seem to recall (vaguely) getting better results with that sometimes. That should be part of the preamble for using 'seminar', if it is setup properly. There is a decent tutorial for using seminar at: http://astronomy.sussex.ac.uk/~eddie/soft/tutorial.html There is also a great reference for including graphics in LaTeX: www.ctan.org/tex-archive/info/epslatex.pdf FWIW, though I have been using seminar for such presentations, I have been recently looking at the Beamer package: http://latex-beamer.sourceforge.net/ and of course, there is also the Prosper package: http://prosper.sourceforge.net/ The one advantage of the Beamer package, for those that require it, is that it supports pdflatex, which the others do not. Though, it can be used with dvips/latex + ps2pdf, where needed. HTH, Marc Note also the package pdfscreen, for use with pdflatex www.river-valley.com/download2.shtml This pretty much uses regular latex, with the page dimensions changed and the font attributes redefined to make them larger than usual inside the slide environment. Be sure to load the packages xspace and colortbl as well as pdfscreen. John Maindonald email: [EMAIL PROTECTED] phone : +61 2 (6125)3473fax : +61 2(6125)5549 Centre for Bioinformation Science, Room 1194, John Dedman Mathematical Sciences Building (Building 27) Australian National University, Canberra ACT 0200. __ [EMAIL PROTECTED] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Re: [R] Re: Thanks Frank, setting graph parameters, and why social scientists don't use R
There are answers that could and should be applied in specific situations. At least in academia and in substantial research teams, statisticians ought to have a prominent part in many of the research teams. Senior statisticians should have a prominent role in deciding the teams to which this applies. why should it be ok to do combine high levels of chemical expertise with truly appalling statistical misunderstandings, to the extent that the suppose chemical insights are not what they appear to be? There should be a major focus on training application area students on training them to understand important ideas, to recognize when they are out of their depth, and to work with statisticians. There should be much more use of statisticians in the refereeing of published papers. Editors need to seek advice from experienced statisticians (some do) on what sorts of papers are candidates for statistical refereeing. Publication in an archive of the data that have been used for a paper could be a huge help, so that others can check whether the data really do support the conclusion. Even better, as Robert Gentleman has argued, would/will be papers that can be processed through Sweave or its equivalent. Really enlightened people (in the statistical sense) in the applied communities will latch onto R, as some are doing, because the limitations inherent in much other software so often lead to crippled and/or misleading analyses. Increasingly, we can hope that it will become difficult for statistics to in various applied area communities to proceed on its merry way, ignorant of or ignoring most of what has happened in the mainstream statistical community in the past 20 years. The statistical community needs to be a lot more aggressive in demanding adequate standards of data analysis in applied areas, at the same time suggesting ways in which it can work with application area people to improve standards. It is also fair to comment that the situation is very uneven. There are some areas where the standards are pretty reasonable, at least for the types of problems that typically come up in those areas. John Maindonald. John Maindonald email: [EMAIL PROTECTED] phone : +61 2 (6125)3473fax : +61 2(6125)5549 Centre for Bioinformation Science, Room 1194, John Dedman Mathematical Sciences Building (Building 27) Australian National University, Canberra ACT 0200. On 18 Aug 2004, Bert Gunter wrote: So we see fairly frequently indications of misunderstanding and confusion in using R. But the problem isn't R -- it's that users don't know enough statistics. . . . . I wish I could say I had an answer for this, but I don't have a clue. I do not thing it's fair to expect a mechnical engineer or psychologist or biologist to have the numerous math and statistical courses and experience in their training that would provide the base they need. For one thing, they don't have the time in their studies for this; for another, they may not have the background or interest -- they are, after all, mechanical engineers or biologists, not statisticians. Unfortunately, they could do their jobs as engineers and scientists a lot better if they did know more statistics. To me, it's a fundamental conundrum, and no one is to blame. It's just the reality, but it is the source for all kinds of frustrations on both sides of the statistical divide, which both you and Roger expressed in your own ways. . . . . __ [EMAIL PROTECTED] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
[R] Re: R-help Digest, Vol 18, Issue 12
The message for aov1 was Estimated effects may be unbalanced. The effects are not unbalanced. The design is 'orthogonal'. The problem is that there are not enough degrees of freedom to estimate all those error terms. If you change the model to: aov1 - aov(RT~fact1*fact2*fact3+Error(sub/(fact1+fact2+fact3)),data=myData) or to aov2 - aov(RT~fact1*fact2*fact3+Error(sub/ ((fact1+fact2+fact3)^2)),data=myData) all is well. This last model (aov2) seems to me to have an excessive number of error terms. The lme model lme(RT~fact1*fact2*fact3, random=~1|sub, data=myData) is equivalent to aov0 - aov(RT~fact1*fact2*fact3+Error(sub), data=myData) It can be verified that the estimated variance components can be used to reproduce the mean squares in the anova table. A conservative approach is to estimate e.g. contrasts involving fact1 for each subject separately, then comparing these against SE estimates that have 4df (5 subjects -1). This is the ultimate check -- are claimed effects consistent against the 5 subjects? The standard errors should though, probably be based on some variety of averaged variance. I do not know how to set up the equivalents of aov1 and aov2 (where the errors are indeed crossed) in lme4. John Maindonald. On 12 Aug 2004, at 8:03 PM, [EMAIL PROTECTED] wrote: I will follow the suggestion of John Maindonald and present the problem by example with some data. I also follow the advice to use mean scores, somewhat reluctantly though. I know it is common practice in psychology, but wouldnt it be more elegant if one could use all the data points in an analysis? The data for 5 subjects (myData) are provided at the bottom of this message. It is a crossed within-subject design with three factors and reaction time (RT) as the dependent variable. An initial repeated-measures model would be: aov1-aov(RT~fact1*fact2*fact3+Error(sub/ (fact1*fact2*fact3)),data=myData) Aov complains that the effects involving fact3 are unbalanced: aov1 Stratum 4: sub:fact3 Terms: fact3 Residuals Sum of Squares 4.81639e-07 5.08419e-08 Deg. of Freedom 2 8 Residual standard error: 7.971972e-05 6 out of 8 effects not estimable Estimated effects may be unbalanced Presumably this is because fact3 has three levels and the other ones only two, making this a non-orthogonal design. I then fit an equivalent mixed-effects model: lme1-lme(RT~fact1*fact2*fact3,data=meanData2,random=~1|sub) Subsequently I test if my factors had any effect on RT: anova(lme1) numDF denDF F-value p-value (Intercept) 144 105294024 .0001 fact1 14422 .0001 fact2 144 7 0.0090 fact3 24419 .0001 fact1:fact2 144 9 0.0047 fact1:fact3 244 1 0.4436 fact2:fact3 244 1 0.2458 fact1:fact2:fact3 244 0 0.6660 Firstly, why are the F-values in the output whole numbers? The effects are estimated over the whole range of the dataset and this is so because all three factors are nested under subjects, on the same level. This, however, seems to inflate the F-values compared to the anova(aov1) output, e.g. anova(aov1) Error: sub:fact2 Df Sum SqMean Sq F value Pr(F) fact2 1 9.2289e-08 9.2289e-08 2.2524 0.2078 Residuals 4 1.6390e-07 4.0974e-08 I realize that the (probably faulty) aov model may not be directly compared to the lme model, but my concern is if the lme estimation of the effects is right, and if so, how can a nave skeptic be convinced of this? The suggestion to use simulate.lme() to find this out seems good, but I cant seem to get it working (from [R] lme: simulate.lme in R it seems it may never work in R). I have also followed the suggestion to fit the exact same model with lme4. However, format of the anova output does not give me the estimation in the way nlme does. More importantly, the degrees of freedom in the denominator dont change, theyre still large: library(lme4) lme4_1-lme(RT~fact1*fact2*fact3,random=~1|sub,data=myData) anova(lme4_1) Analysis of Variance Table DfSum Sq Mean Sq Denom F valuePr(F) fact1I1 2.709e-07 2.709e-0748 21.9205 2.360e-05 *** fact2I1 9.229e-08 9.229e-0848 7.4665 0.008772 ** fact3L1 4.906e-08 4.906e-0848 3.9691 0.052047 . fact3M1 4.326e-07 4.326e-0748 34.9972 3.370e-07 *** fact1I:fact2I 1 1.095e-07 1.095e-0748 8.8619 0.004552 ** fact1I:fact3L 1 8.988e-10 8.988e-1048 0.0727 0.788577 fact1I:fact3M 1 1.957e-08 1.957e-0848 1.5834 0.214351 fact2I:fact3L 1 3.741e-09 3.741e-0948 0.3027 0.584749 fact2I:fact3M 1 3.207e-08 3.207e-0848 2.5949 0.113767 fact1I:fact2I:fact3L 1 2.785e-09 2.785e-09
Fwd: [R] Enduring LME confusion or Psychologists and Mixed-Effects
In my undertstanding of the problem, the model lme1 - lme(resp~fact1*fact2, random=~1|subj) should be ok, providing that variances are homogenous both between within subjects. The function will sort out which factors interactions are to be compared within subjects, which between subjects. The problem with df's arises (for lme() in nlme, but not in lme4), when random effects are crossed, I believe. It is difficult to give a general rule on the effect of imbalance; it depends on the relative contributions of the two variances and the nature of the imbalance. There should be a rule that people who ask these sorts of questions are required to make their data available either (if the data set is small) as part of their message or (if data are extensive) on a web site. Once the results of the analysis are on display, it is often possible to make an informed guess on the likely impact. Use of simulate.lme() seems like a good idea. John Maindonald. On 11 Aug 2004, at 8:05 PM, [EMAIL PROTECTED] wrote: From: Spencer Graves [EMAIL PROTECTED] Date: 10 August 2004 8:44:20 PM To: Gijs Plomp [EMAIL PROTECTED] Cc: [EMAIL PROTECTED] Subject: Re: [R] Enduring LME confusion or Psychologists and Mixed-Effects Have you considered trying a Monte Carlo? The significance probabilities for unbalanced anovas use approximations. Package nlme provides simulate.lme to facilitate this. I believe this function is also mentioned in Pinheiro and Bates (2000). hope this helps. spencer graves p.s. You could try the same thing in both library(nlme) and library(lme4). Package lme4 is newer and, at least for most cases, better. Gijs Plomp wrote: Dear ExpeRts, Suppose I have a typical psychological experiment that is a within-subjects design with multiple crossed variables and a continuous response variable. Subjects are considered a random effect. So I could model aov1 - aov(resp~fact1*fact2+Error(subj/(fact1*fact2)) However, this only holds for orthogonal designs with equal numbers of observation and no missing values. These assumptions are easily violated so I seek refuge in fitting a mixed-effects model with the nlme library. lme1 - lme(resp~fact1*fact2, random=~1|subj) When testing the significance of the effects of my factors, with anova(lme1), the degrees of freedom that lme uses in the denominator spans all observations and is identical for all factors and their interaction. I read in a previous post on the list ([R] Help with lme basics) that this is inherent to lme. I studied the instructive book of Pinheiro Bates and I understand why the degrees of freedom are assigned as they are, but think it may not be appropriate in this case. Used in this way it seems that lme is more prone to type 1 errors than aov. To get more conservative degrees of freedom one could model lme2 - lme(resp~fact1*fact2, random=~1|subj/fact1/fact2) But this is not a correct model because it assumes the factors to be hierarchically ordered, which they are not. Another alternative is to model the random effect using a matrix, as seen in [R] lme and mixed effects on this list. lme3 - (resp~fact1*fact2, random=list(subj=pdIdent(form=~fact1-1), subj=~1, fact2=~1) This provides correct degrees of freedom for fact1, but not for the other effects and I must confess that I don't understand this use of matrices, Im not a statistician. My questions thus come down to this: 1. When aovs assumptions are violated, can lme provide the right model for within-subjects designs where multiple fixed effects are NOT hierarchically ordered? 2. Are the degrees of freedom in anova(lme1) the right ones to report? If so, how do I convince a reviewer that, despite the large number of degrees of freedom, lme does provide a conservative evaluation of the effects? If not, how does one get the right denDf in a way that can be easily understood? I hope that my confusion is all due to an ignorance of statistics and that someone on this list will kindly point that out to me. I do realize that this type of question has been asked before, but think that an illuminating answer can help R spread into the psychological community. John Maindonald email: [EMAIL PROTECTED] phone : +61 2 (6125)3473fax : +61 2(6125)5549 Centre for Bioinformation Science, Room 1194, John Dedman Mathematical Sciences Building (Building 27) Australian National University, Canberra ACT 0200. __ [EMAIL PROTECTED] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Re: [R] anti-R vitriol
I am curious. What were the dimensions of this data set? Did this person know use read.table(), or scan(). Did they know about the possibility of reading the data one part at a time? The way that SAS processes the data row by row limits what can be done. It is often possible with scant loss of information, and more satisfactory, to work with a subset of the large data set or with multiple subsets. Neither SAS (in my somewhat dated experience of it) nor R is entirely satisfactory for this purpose. But at least in R, given a subset that fits so easily into memory that the graphs are not masses of black, there are few logistic problems in doing, rapidly and interactively, a variety of manipulations and plots, with each new task taking advantage of the learning that has gone before. To do that well in the SAS world, it is necessary to use something like JMP or its equivalent in one of the newer modules, which process data in a way that is not all that different from R. I have wondered about possibilities for a suite of functions that would make it easy to process through R data that is stored in one large data set, with a mix of adding a new variable or variables, repeating a calculation on successive subsets of the data, producing predictions or suchlike for separate subsets, etc. Database connections may be the way to go (c.f., the Ripley and Fei Chen paper at ISI 2003), but it might also be useful to have a simple set of functions that would handle some standard requirements. John Maindonald. On 30 Jun 2004, at 8:02 PM, Barry Rowlingson [EMAIL PROTECTED] wrote: A colleague is receiving some data from another person. That person reads the data in SAS and it takes 30s and uses 64k RAM. That person then tries to read the data in R and it takes 10 minutes and uses a gigabyte of RAM. Person then goes on to say: It's not that I think SAS is such great software, it's not. But I really hate badly designed software. R is designed by committee. Worse, it's designed by a committee of statisticians. They tend to confuse numerical analysis with computer science and don't have any idea about software development at all. The result is R. I do hope [your colleague] won't have to waste time doing [this analysis] in an outdated and poorly designed piece of software like R. Would any of the committee like to respond to this? Or shall we just slap our collective forehead and wonder how someone could get such a view? John Maindonald email: [EMAIL PROTECTED] phone : +61 2 (6125)3473fax : +61 2(6125)5549 Centre for Bioinformation Science, Room 1194, John Dedman Mathematical Sciences Building (Building 27) Australian National University, Canberra ACT 0200. __ [EMAIL PROTECTED] mailing list https://www.stat.math.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Re: [R] terminology for frames and environments
From: Peter Dalgaard [EMAIL PROTECTED] Date: 14 June 2004 5:42:29 PM Gabor Grothendieck [EMAIL PROTECTED] writes: ... Could someone please clarify what standard terminology is? Not sure there is one... We have been approaching consensus on a couple of occasions, but (obviously) not been too good at enforcing it. I think the consensus is that a frame is a set of variable bindings (implemented as a hashed list), an environment is a frame plus an enclosing environment, i.e. a linked list of frames, terminated by NULL. It is occasionally necessary to refer to the individual frames as opposed to the whole list, which is exactly the point of the inherits argument. Notice that exists() talks about enclosing which is only ever used in sense #1 above. parent is used in both senses (which is a bit unfortunate -- not quite sure whether we have decided to get rid of parent.env() eventually). -- O__ Peter Dalgaard Blegdamsvej 3 c/ /'_ --- Dept. of Biostatistics 2200 Cph. N (*) \(*) -- University of Copenhagen Denmark Ph: (+45) 35327918 ~~ - ([EMAIL PROTECTED]) FAX: (+45) 35327907 I have found it helpful, in trying to explain (to myself and others) what happens, to say that there is both a lexical stack and a call stack. Is that a legitimate use of terminology? John Maindonald email: [EMAIL PROTECTED] phone : +61 2 (6125)3473fax : +61 2(6125)5549 Centre for Bioinformation Science, Room 1194, John Dedman Mathematical Sciences Building (Building 27) Australian National University, Canberra ACT 0200. __ [EMAIL PROTECTED] mailing list https://www.stat.math.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Re: [R] p-values
Frank - Thanks for your reply, on which I really have no comment. There are contexts where a Bayesian approach necessary, natural and easy to handle, and can be used to broaden the inferential vision of students. Examples from HIV testing and mammography screening in Gigerenzer's book, sold in the USA under the title: Calculated Risks: How to Know When Numbers Deceive You (Simon Schuster) can be used to make highly important points. I have, in the section on inference in my book with John Braun, injected some of this into the discussion of inference. [Incidentally, Frank's book does have a great deal of practically oriented comment (e.g., wrt variable selection) that I have found useful. There is useful critical comment on inappropriate use of p-values in such contexts.] John Maindonald. On 3 May 2004, at 9:58 AM, Frank E Harrell Jr wrote: I'm sorry to have taken so long in responding to your excellent question John. And I'm responding to r-help since the question was posed there before taking the discussion offline. In the words of Don Berry It takes time to be a Bayesian and that's the main reason there are no explicit Bayesian calculations in the book. I do make a lot of use of prior information though. In the future I could see making some additions to the book along the lines of inclusion of examples using the MCMCpack package, whose design is appealing to me. R provides a lot of help for those who want a frequentist interpretation, even to including by default the *, **, *** labeling that some of us deplore. There is no similar help for those who want at least the opportunity to place the output from a modeling exercise in a Bayesian context of some description. There is surely a strong argument for the use of a more neutral form of default output, even to the excluding of p-values, on the argument that they also push too strongly in the direction of a frequentist interpretative framework. Agreed. I do try to approximate the Bayesian approach with the bootstrap. There seems, unfortunately, to be a dearth of good ideas on how the assist the placing of output from modeling functions such as R provides in an explicitly Bayesian framework. . . . . It's all worth pursuing. I wish there were already a Bayesian package that made use of Bayesian methods irresistable. All the best, Frank John Maindonald email: [EMAIL PROTECTED] phone : +61 2 (6125)3473fax : +61 2(6125)5549 Centre for Bioinformation Science, Room 1194, John Dedman Mathematical Sciences Building (Building 27) Australian National University, Canberra ACT 0200. __ [EMAIL PROTECTED] mailing list https://www.stat.math.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
[R] p-values
The Bayesian framework is surely a good framework for thinking about inference, and for exploring common misinterpretations of p-values. P-values are surely unhelpful and to be avoided in cases where there is `strong' prior evidence. I will couch the discussion that follows in terms of confidence intervals, which makes the discussion simpler, rather than in terms of p-values. The prior evidence is in my sense strong if it leads to a Bayesian credible interval that is very substantially different from the frequentist confidence interval (though I prefer the term `coverage interval'). Typically the intervals will be similar if a diffuse prior is used, i.e., all values over a wide enough range are, on some suitable scale, a-priori equally likely. This is, in my view, the message that you should take from your reading. Examples of non-diffuse priors are what Berger focuses on. Consider for example his discussion of one of Jeffreys' analyses, where Jeffreys puts 50% of the probability on on a point value of a a continuous parameter, i.e., there is a large spike in the prior at that point. Berger commonly has scant commentary on the specific features of his priors that make the Bayesian results seem very different (at least to the extent of having a different feel) from the frequentist results. His paper in vol 18, no 1 of Statistical Science (pp.1-32; pp.12-27 are comment from other) seems more judicious in this respect than some of his earlier papers. It is interesting to speculate how R's model fitting routines might be tuned to allow a Bayesian interpretation. What family or families of priors would be on offer, and/or used by default? What default mechanisms would be suitable useful for indicating the sensitivity of results to the choice of prior? John Maindonald. From: Greg Tarpinian [EMAIL PROTECTED] Date: 28 April 2004 6:32:06 AM To: [EMAIL PROTECTED] Subject: [R] p-values I apologize if this question is not completely appropriate for this list. I have been using SAS for a while and am now in the process of learning some C and R as a part of my graduate studies. All of the statistical packages I have used generally yield p-values as a default output to standard procedures. This week I have been reading Testing Precise Hypotheses by J.O. Berger Mohan Delampady, Statistical Science, Vol. 2, No. 3, 317-355 and Bayesian Analysis: A Look at Today and Thoughts of Tomorrow by J.O. Berger, JASA, Vol. 95, No. 452, p. 1269 - 1276, both as supplements to my Math Stat. course. It appears, based on these articles, that p-values are more or less useless. If this is indeed the case, then why is a p-value typically given as a default output? For example, I know that PROC MIXED and lme( ) both yield p-values for fixed effects terms. The theory I am learning does not seem to match what is commonly available in the software, and I am just wondering why. Thanks, Greg John Maindonald email: [EMAIL PROTECTED] phone : +61 2 (6125)3473fax : +61 2(6125)5549 Centre for Bioinformation Science, Room 1194, John Dedman Mathematical Sciences Building (Building 27) Australian National University, Canberra ACT 0200. __ [EMAIL PROTECTED] mailing list https://www.stat.math.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Re:[R] p-values
This is, of course, not strictly about R. But if there should be a decision to pursue such matters on this list, then we'd need another list to which such discussion might be diverted. I've pulled Frank's Regression Modeling Stratregies down from my shelf and looked to see what he says about inferential issues. There is a suggestion, in the introduction, that modeling provides the groundwork that can be used a point of departure for a variety of inferential interpretations. As far as I can see Bayesian interpretations are never really explicitly discussed, though the word Bayesian does appear in a couple of places in the text. Frank, do you now have ideas on how you would (perhaps, in a future edition, will) push the discussion in a more overtly Bayesian direction? What might be the style of a modeling book, aimed at practical data analysts who of necessity must (mostly, at least) use off-the-shelf software, that seriously entertains the Bayesian approach? R provides a lot of help for those who want a frequentist interpretation, even to including by default the *, **, *** labeling that some of us deplore. There is no similar help for those who want at least the opportunity to place the output from a modeling exercise in a Bayesian context of some description. There is surely a strong argument for the use of a more neutral form of default output, even to the excluding of p-values, on the argument that they also push too strongly in the direction of a frequentist interpretative framework. There seems, unfortunately, to be a dearth of good ideas on how the assist the placing of output from modeling functions such as R provides in an explicitly Bayesian framework. Or is it, at least in part, that I am unaware of what is out there? That, I guess, is the point of my question to Frank. Is it just too technically demanding to go much beyond trying to get users to understand that a Bayesian credible interval can, if there is an informative prior, be very different from a frequentist CI, that they really do need to pause if there is an informative prior lurking somewhere in the undergrowth? John Maindonald. Frank Harrell wrote: They [p-values] are objective only in the sense that subjectivity is deferred in a difficult to document way when P-values are translated into decisions. The statement that frequentist methods are the norm, which I'm afraid is usually true, is a sad comment on the state of much of scientific inquiry. IMHO P-values are so defective that the imperfect Bayesian approach should be seriously entertained. John Maindonald email: [EMAIL PROTECTED] phone : +61 2 (6125)3473fax : +61 2(6125)5549 Centre for Bioinformation Science, Room 1194, John Dedman Mathematical Sciences Building (Building 27) Australian National University, Canberra ACT 0200. __ [EMAIL PROTECTED] mailing list https://www.stat.math.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
[R] Solutions to Exercises - Data Analysis Graphics Using R
This message is aimed at anyone who may be using exercises from my book with John Braun for teaching purposes. I am using this channel of communication in the absence of any other obvious effective channel. I ask the forbearance of list members. Our intention is to post solutions to selected exercises (the more challenging exercises) on the web, via a link from http://wwwmaths.anu.edu.au/~johnm/r-book.html Anyone who would find this inconvenient should contact me directly. I will shortly post a provisional list on the above web site. John Maindonald email: [EMAIL PROTECTED] phone : +61 2 (6125)3473fax : +61 2(6125)5549 Centre for Bioinformation Science, Room 1194, John Dedman Mathematical Sciences Building (Building 27) Australian National University, Canberra ACT 0200. __ [EMAIL PROTECTED] mailing list https://www.stat.math.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
[R] simple test of lme, questions on DF corrections
Dear Greg - My understanding is that method=ML changes only the method used to fit the model. The parameter estimates are not the ML parameter estimates. The fitted values are the fitted values for ML. The easiest way to get the anova sum of squares is to specify: fm1.aov - aov(travel~1+Error(factor(Rail)),data=Rail) summary(fm1.aov) Error: factor(Rail) Df Sum Sq Mean Sq F value Pr(F) Residuals 5 9310.5 1862.1 Error: Within Df Sum Sq Mean Sq F value Pr(F) Residuals 12 194.000 16.167 This compares with: fm1.lme - lme(travel~1, random=~1|Rail,data=Rail,method=ML) sum(fm1.lme$residuals[,2]^2) [1] 195.0106 sqrt(195.0106/12) [1] 4.031238 Consistent with this, the predicted values that are obtained with predict(fm1.lme,level=1) (the Best Linear Unbiased Predictors, or BLUPs) are not the group means that you can get from: predict(aov(travel~Rail,data=Rail)) Thus a residual mean square of 194/12 has become, in the ML fit, 195.0106/12. This change in the predicted values (BLUPs), in the residuals, and in the sum of squares of residuals, arises because the likelihood is now being maximised for a model where there are two variance parameters that must be estimated. Notice that the BLUPs are pulled back towards the overall mean, relative to the group means. NB also, specify level=1 to incorporate the random group (Rail) effects into the predicted values. John Maindonald email: [EMAIL PROTECTED] phone : +61 2 (6125)3473fax : +61 2(6125)5549 Centre for Bioinformation Science, Room 1194, John Dedman Mathematical Sciences Building (Building 27) Australian National University, Canberra ACT 0200. - Date: Sat, 29 Mar 2003 20:19:33 -0500 From: Greg Hammett [EMAIL PROTECTED] Subject: [R] simple test of lme, questions on DF corrections To: [EMAIL PROTECTED] Message-ID: [EMAIL PROTECTED] Content-Type: text/plain; charset=US-ASCII I'm a physicist working on fusion energy and dabble in statistics only occasionally, so please excuse gaps in my statistical knowledge. I'd appreciate any help that a real statistics expert could provide. Most people in my field do only very simple statistics, and I am trying to extend some work on multivariate linear regression to account for significant between-group correlated errors. It looks to me like the lme routine in the nlme package does most of what I want. (My congrats to the developers of R and nlme, they are extremely useful!). To summarize my questions: I've tested lme on a very simple test case (with just 1 parameter, the mean, and 1 random effect), but the results are somewhat different than I get from some simple maximum-likelhood formulas. It appears that some quantities calculated by lme are corrected for the reduction in the degrees of freedom due to the number of fit parameters. But these corrections are slightly different (0.3%-3%) than what I would have expected, and I'd like to understand these differences better. . --- __ [EMAIL PROTECTED] mailing list https://www.stat.math.ethz.ch/mailman/listinfo/r-help
[R] (no subject)
Let's assume that the columns of the model matrix, apart perhaps from an initial column that corresponds to the overall mean, have been centred. Then: 1) Normal equation methods give an accurate fit to the matrix of centred sums of squares and products. 2) QR methods give an accurate fit to the predicted values. QR will give better precision than normal equation methods (e.g., Cholesky) if there are substantial correlations between the columns of the model matrix. This is because sequential normal equations methods successively modify the centred sums of squares and products (CSSP) matrix to be a representation of the matrix of sums of squares and products of partial residuals as columns of the model matrix are partialed out in turn. QR directly modifies a representation of the partial residuals themselves. If columns of the model matrix are almost uncorrelated then normal equation methods may however give the better precision, essentially because the CSSP matrix does not change much and normal equation methods require fewer arithmetic operations. In the situations where QR gives substantially better precision, the correlations between columns of the model matrix will mean that some regression coefficients have a large standard error. The variance inflation factor for some regression coefficients will be large. Will the additional precision be meaningful? The question has especial point now that double precision is standardly used. I think it useful to expose students to both classes of method. In contexts where QR gives results that are numerically more precise, I'd encourage them to examine the variance inflation factors (they should examine them anyway). It is often a good idea, if the VIFs are large, to consider whether there is a simple re-parameterization [perhaps as simple as replacing x1 and x2 by (x1+x2) and (x1-x2)] where correlations create less havoc. This may be an important issue for interpretation, even if the increased numerical accuracy serves no useful purpose. -- Date: Mon, 24 Feb 2003 13:50:31 -0500 From: Chong Gu [EMAIL PROTECTED] Not only it's unfair criticism, it's probably also imprecise information. For a detailed discussion of the precisions of regression estimates through QR-decomposition and normal equations, one may consult Golub and Van Loan's book on Matrix Computation (1989, Section 5.3.9 on page 230). QR takes twice as much computation, requires more memory, but does NOT necessarily provide better precision. The above said, I am not questioning the adequacy of the QR approach to regression calculation as implemented in R. That's an unfair criticism. That discussion was never intended as a recommendation for how to compute a regression. Of course, SVD or QR decompositions are the preferred method but many newbies don't want to digest all that right from the start. These are just obscure details to the beginner. One of the strengths of R in teaching is that students can directly implement the formulae from the theory. This reinforces the connection between theory and practice. Implementing the normal equations directly is a quick early illustration of this connection. Explaining the precise details of how to fit a regression model is something that can be deferred. Julian Faraway I am just about working through Faraways excellent tutorial practical regression and ANOVA using R I assume this is a reference to the PDF version available via CRAN. I am afraid that is *not* a good discussion of how to do regression, especially not using R. That page is seriously misleading: good ways to compute regressions are QR decompositions with pivoting (which R uses) or an SVD. Solving the normal equations is well known to square the condition number, and is close to the worse possible way. (If you must use normal equations, do at least centre the columns, and preferably do some scaling.) on page 24 he makes the x matrix: x - cbind(1,gala[,-c(1,2)]) how can I understand this gala[,-c(1,2)])... I couldn't find an explanation of such c-like abbreviations anywhere. Well, it is in all good books (as they say) including `An Introduction to R'. (It's even on page 210 of that book!) -c(1,2) is (try it) -c(1,2) [1] -1 -2 so this drops columns 1 and 2. It then adds in front a column made up of ones, which is usually a sign of someone not really understanding how R's linear models work. __ [EMAIL PROTECTED] mailing list http://www.stat.math.ethz.ch/mailman/listinfo/r-help John Maindonald email: [EMAIL PROTECTED] phone : +61 2 (6125)3473fax : +61 2(6125)5549 Centre for Bioinformation Science, Room 1194, John Dedman Mathematical Sciences Building (Building 27) Australian National University, Canberra ACT 0200. __ [EMAIL PROTECTED] mailing list http
Re: [R] acceptable p-level for scientific studies
Another interesting paper is: Gigerenzer, G. 1998. We need statistical thinking, not statistical rituals. Behavioural and Brain Sciences 21: 199-200. It is interesting that Gigerenzer's recent, highly readable and disturbing book :Calculated Risks (US: Simon Schuster) or Reckoning with Risk (UK, Allen Lane) has no reference to p-values in the index. Gigerenzer describes an investigation where just one AIDS counsellor out of 20 showed any recognition that the estimate of the probability of infection, given a positive AIDS test, would depend on the risk group from which the person came. The very small P[positive test | no infection] is an incomplete part of the story. Regrettably I suspect that the result would be much the same for AIDS or genetics counsellors anywhere. John Maindonald email: [EMAIL PROTECTED] phone : +61 2 (6125)3473fax : +61 2(6125)5549 Centre for Bioinformation Science, Room 1194, John Dedman Mathematical Sciences Building (Building 27) Australian National University, Canberra ACT 0200. __ [EMAIL PROTECTED] mailing list http://www.stat.math.ethz.ch/mailman/listinfo/r-help