from:"John Maindonald"

Re: [R] residual plots for lmer in lme4 package

2007-08-17 Thread John Maindonald

I am doubtful whether standard residual plots are very useful
in this context.  One wants the theoretical effects Ui to have a
normal distribution.  If there are similar amounts of information
on each patient, maybe it will not be too bad to extract the
estimated effects and check them for normality. I don't think
you can use residuals() to extract them, as glmer() does
not have the notion of levels.  Maybe they can be extracted
using ranef(), but I do not see any examples for use with
glmer() on the help pages.

The issue of checking for normality of effects in multi-level
models has not been very much researched, as far as I can
tell.  The function residuals()  gives residuals that adjust for
all except the highest level of random effects.  Depending
on the relative magnitudes of the variance components,
whether or not these residuals are anywhere near normal
may not be of much or any consequence.

John Maindonald email: [EMAIL PROTECTED]
phone : +61 2 (6125)3473fax  : +61 2(6125)5549
Centre for Mathematics  Its Applications, Room 1194,
John Dedman Mathematical Sciences Building (Building 27)
Australian National University, Canberra ACT 0200.


On 17 Aug 2007, at 8:00 PM, [EMAIL PROTECTED] wrote:

 From: Martin Henry H. Stevens [EMAIL PROTECTED]
 Date: 17 August 2007 12:08:15 AM
 To: Margaret Gardiner-Garden [EMAIL PROTECTED]
 Cc: [EMAIL PROTECTED] [EMAIL PROTECTED]
 Subject: Re: [R] residual plots for lmer in lme4 package


 Hi Margaret,
 Have a look at qqmath in the lattice package.
 ?qqmath
 Hank
 On Aug 16, 2007, at 2:45 AM, Margaret Gardiner-Garden wrote:

 Hi,



 I was wondering if I might be able to ask some advice about doing  
 residual
 plots for the lmer function in the lme4 package.



 Our group's aim is to find if the expression staining of a  
 particular gene
 in a sample (or core)  is related to the pathology of the core.

 To do this, we used the lmer function to perform a logistic mixed  
 model
 below.  I apologise in advance for the lack of subscripts.



  logit P(yij=1) = â0 + Ui + â1Patholij where Ui~N(0, óu2),

 i indexes patient, j indexes measurement, Pathol is an indicator  
 variable
 (0,1) for benign

 epithelium versus cancer and yij is the staining indicator (0,1)  
 for each
 core where yij equals 1 if the core stains positive and 0 otherwise.



 (I have inserted some example R code at the end of this message)



 I was wondering if you knew how I could test that the errors Ui  
 are normally
 distributed in my fit.  I am not familiar with how to do residual  
 plots for
 a mixed logistic regression (or even for any logistic regression!).



 Any advice would be greatly appreciated!



 Thanks and Regards

 Marg


__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] LSD, HSD,...

2007-07-16 Thread John Maindonald

follow-on rant Stepwise regression variable selection
methods make multiple post hoc comparisons.  The
number of comparisons may be very large, vastly more
than the half-dozen post-hoc comparisons that are
common in an experimental design context.

There is a disconnect here. The multiple testing issue is
noted in pretty much every discussion of analysis of
experimental data, but not commonly mentioned (at least
in older texts) in discussions of stepwise regression, best
subsets and related regression approaches. One reason
for this silence may be that there is no ready HSD-like fix.

The SEs and t-statistics that lm() gives for the finally
selected model can be grossly optimistic. Running the
analysis with the same model matrix, but with y-values
that are noise, can give a useful wake-up call.

John Maindonald email: [EMAIL PROTECTED]
phone : +61 2 (6125)3473fax  : +61 2(6125)5549
Centre for Mathematics  Its Applications, Room 1194,
John Dedman Mathematical Sciences Building (Building 27)
Australian National University, Canberra ACT 0200.


On 16 Jul 2007, at 8:00 PM, Simon Blomberg wrote:

 If you have a priori planned comparisons, you can just test those  
 using
 linear contrasts, with no need to correct for multiple testing. If you
 do not, and you are relying on looking at the data and analysis to  
 tell
 you which treatment means to compare, and you are considering several
 tests, then you should consider correcting for multiple testing. There
 is a large literature on the properties of the various tests.  
 (Tukey HSD
 usually works pretty well for me.)

 rant Why do people design experiments with a priori hypotheses in
 mind, yet test them using post hoc comparison procedures? It's as if
 they are afraid to admit that they had hypotheses to begin with! Far
 better to test what you had planned to test using the more powerful
 methods for planned comparisons, and leave it at that.
 /rant


 On Mon, 2007-07-16 at 09:52 +0200, Adrian J. Montero Calvo wrote:
 Hi,
 I'm designing a experiment in order to compare the growing of
 several clones of a tree specie. It will be a complete randomized  
 block
 design. How can I decide what model of mean comparision to choose?  
 LSD,
 HSD,TukeyHSD,  Duncan,...?  Thanks in advance

 __
 R-help@stat.math.ethz.ch mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting- 
 guide.html
 and provide commented, minimal, self-contained, reproducible code.
 -- 
 Simon Blomberg, BSc (Hons), PhD, MAppStat.
 Lecturer and Consultant Statistician
 Faculty of Biological and Chemical Sciences
 The University of Queensland
 St. Lucia Queensland 4072
 Australia
 Room 320 Goddard Building (8)
 T: +61 7 3365 2506
 email: S.Blomberg1_at_uq.edu.au

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

[R] Bad points in regression [Broadcast]

2007-03-17 Thread John Maindonald

None of Andy's comments) are inconsistent with the point that
rlm() and lqs(), if they disagree with lm(),  likely offer better
places to start than does lm(), in identifying points that should
be examined as in some sense outliers. All such methods are
to be used, not as crutches, but as sources of insight.

Incidentally, the rlm class inherits from lm, and plot.lm()
(or, preferably, the plot generic) can be used with rlm objects.
This is not the case for lqs objects.

John Maindonald.


On 17 Mar 2007, at 10:00 PM, Andy Liaw wrote:

 From: Liaw, Andy [EMAIL PROTECTED]
 Date: 17 March 2007 4:21:53 AM
 To: Bert Gunter [EMAIL PROTECTED],  
 [EMAIL PROTECTED], r-help@stat.math.ethz.ch
 Subject: Re: [R] Bad points in regression  [Broadcast]


 (My turn on the soapbox ...)

 I'd like to add a bit of caveat to Bert's view.  I'd argue (perhaps  
 even
 plead) that robust/resistant procedures be used with care.  They  
 should
 not be used as a shortcut to avoid careful analysis of data.  I  
 recalled
 that in my first course on regression, the professor made it clear  
 that
 we're fitting models to data, not the other way around.  When the  
 model
 fits badly to (some of the) the data,  do examine and think carefully
 about what happened.  Verify that bad data are indeed bad,  
 instead of
 using statistical criteria to make that judgment.  A scientific
 colleague reminded me of this point when I tried to sell him the  
 idea of
 robust/resistant methods:  Don't use these methods as a crutch to  
 stand
 on badly run experiments (or poorly fitted models).

 Cheers,
 Andy

 From: Bert Gunter

 (mount soapbox...)

 While I know the prior discussion represents common practice,
 I would argue
 -- perhaps even plead -- that the modern(?? 30 years old 
 now) alternative of robust/resistant estimation be used,
 especially in the readily available situation of
 least-squares regression. RSiteSearch(Robust) will bring up
 numerous possibilities.rrcov and robustbase are at least two
 packages devoted to this, but the functionality is available
 in many others (e.g.
 rlm() in MASS).

 Bert Gunter
 Genentech Nonclinical Statistics
 South San Francisco, CA 94404
 650-467-7374





 -Original Message-
 From: [EMAIL PROTECTED]
 [mailto:[EMAIL PROTECTED] On Behalf Of Ted Harding
 Sent: Friday, March 16, 2007 6:44 AM
 To: r-help@stat.math.ethz.ch
 Subject: Re: [R] Bad points in regression

 On 16-Mar-07 12:41:50, Alberto Monteiro wrote:
 Ted Harding wrote:

 alpha - 0.3
 beta - 0.4
 sigma - 0.5
 err - rnorm(100)
 err[15] - 5; err[25] - -4; err[50] - 10 x - 1:100 y
 - alpha +
 beta * x + sigma * err ll - lm(y ~ x)
 plot(ll)

 ll is the output of a linear model fiited by lm(), and so
 has several
 components (see ?lm in the section Value), one of which is
 residuals (which can be abbreviated to res).

 So, in the case of your example,

   which(abs(ll$res)2)
   15 25 50

 extracts the information you want (and the 2 was inspired by
 looking at the residuals plot from your plot(ll)).

 Ok, but how can I grab those points _in general_? What is the
 criterium that plot used to mark those points as bad points?

 Ahh ... ! I see what you're after. OK, look at the plot
 method for lm():

 ?plot.lm
   ## S3 method for class 'lm':
   plot(x, which = 1:4,
 caption = c(Residuals vs Fitted, Normal Q-Q plot,
   Scale-Location plot, Cook's distance plot),
   panel = points,
   sub.caption = deparse(x$call), main = ,
   ask = prod(par(mfcol))  length(which)  dev.interactive(),
   ...,
   id.n = 3, labels.id = names(residuals(x)), cex.id = 0.75)


 where (see further down):

   id.n: number of points to be labelled in each plot, starting with
 the most extreme.

 and note, in the default parameter-values listing above:

   id.n = 3

 Hence, the 3 most extreme points (according to the criterion
 being plotted in each plot) are marked in each plot.

 So, for instance3, try

   plot(ll,id.n=5)

 and you will get points 10,15,25,28,50. And so on. But that
 pre-supposes that you know how many points are exceptional.


 What is meant by extremeis not stated in the help page
 ?plot.lm, but can be identified by inspecting the code for
 plot.lm(), which you can see by entering

   plot.lm

 In your example, if you omit the line which assigns anomalous
 values to err[15[, err[25] and err[50], then you are likely
 to observe that different points get identified on different
 plots. For instance, I just got the following results for the
 default id.n=3:

 [1] Residuals vs Fitted:   41,53,59
 [2] Standardised Residuals:41,53,59
 [3] sqrt(Stand Res) vs Fitted: 41,53,59
 [4] Cook's Distance:   59,96,97


 There are several approaches (with somewhat different
 outcomes) to identifying outliers. If you apply one of
 these, you will probably get the identities of the points anyway.

 Again in the context of your example (where in fact you
 deliberately set 3 points to have exceptional errors, thus

[R] [R-pkgs] New version of lme4 and new mailing list R-SIG-mixed-models

2007-01-28 Thread John Maindonald

Dear Douglas -
I have tried several fits that relate to ch 10 of the 2nd edition of  
Maindonald  Braun.  In two cases a fit that does not currently  
converge with lmer does now converge with lme2.  In one case where I  
took logs in order to get convergence (the data did not really demand  
it), I can now get away without taking logs.  Basically, none of the  
problems that I had to work around when I did the calculations for  
that chapter have surfaced.  An exercise that had been problematic  
now converges without apparent problem.

In one case CPU time was reduced by a factor of around 3.  In one  
case where I formerly could not get convergence, the reduction (to  
termination of the calculations) was by a factor of around 10.  The  
results seems similar, with a small increase in the loglikelihood  
(perhaps ~0.5; I have not checked very carefully) in some cases.  If  
you want more detail, I will provide it.

It looks a huge improvement.  I am sure that R users will be duly  
grateful for your efforts.
Regards
John Maindonald.


John Maindonald email: [EMAIL PROTECTED]
phone : +61 2 (6125)3473fax  : +61 2(6125)5549
Centre for Mathematics  Its Applications, Room 1194,
John Dedman Mathematical Sciences Building (Building 27)
Australian National University, Canberra ACT 0200.

  sessionInfo()
R version 2.4.1 Patched (2007-01-17 r40518)
i386-apple-darwin8.8.1

locale:
C

attached base packages:
[1] stats graphics  grDevices datasets  utils  
methods
[7] base

other attached packages:
DAAGMASSlme4  Matrix lattice
  0.917.2-31 0.9975-11  0.9975-8   0.14-16

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] overdispersion

2007-01-12 Thread John Maindonald

I would say rather that for binary data (binomial data with n=1) it  
is not possible to detect overdispersion from examination of the  
Pearson chi-square or the deviance.   Overdispersion may be, and  
often is, nevertheless present.  I am arguing that overdispersion is  
properly regarded as a function of the variance-covariance structure,  
not as a function of the sample data.

The variance of a two-point distribution is a known function of the  
mean, providing that independence and identity of distribution can be  
assumed, or providing that the correlation structure is otherwise  
known and the mean is constant. That proviso is crucial!

If there is some sort of grouping, it may be appropriate to aggregate  
data over the groups, yielding data that have a binomial form with  
n1.  Over-dispersion can now be detected from the Pearson chi-square  
or from the deviance.  Note that the quasi models assume that the  
multiplier for the binomial or other variance is constant with p;  
that may or may not be realistic.  Generalized linear mixed models  
make their own different assumptions about how the variance changes  
as a function of p; again these may or may not be realistic.

It is then the error structure that is crucial. To the extent that  
distracts from careful thinking about that structure, the term  
overdispersion is unsatisfactory.

There's no obvious way that I can see to supply glm() with an  
estimate of the dispersion that has been derived independently of the  
current analysis.  Especially in the binary case, this would  
sometimes be useful.

John Maindonald email: [EMAIL PROTECTED]
phone : +61 2 (6125)3473fax  : +61 2(6125)5549
Centre for Mathematics  Its Applications, Room 1194,
John Dedman Mathematical Sciences Building (Building 27)
Australian National University, Canberra ACT 0200.


On 12 Jan 2007, at 10:00 PM, [EMAIL PROTECTED] wrote:

 From: Peter Dalgaard [EMAIL PROTECTED]
 Date: 12 January 2007 5:04:26 AM
 To: evaiannario [EMAIL PROTECTED]
 Cc: r-help@stat.math.ethz.ch r-help@stat.math.ethz.ch
 Subject: Re: [R] overdispersion


 evaiannario wrote:
 How can I eliminate the overdispersion for binary data apart the  
 use of the quasibinomial?
 There is no such thing as overdispersion for binary data. (The  
 variance of a two-point distribution is a known function of the  
 mean.) If what you want to do is include random effects of some  
 sort of grouping then you might look into generalized linear mixed  
 models via lmer() from the lme4 package or glmmPQL from MASS.

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

[R] unexpected result in glm (family=poisson) for data with an only zero response in one factor

2006-09-13 Thread John Maindonald

The Wald statistics that are returned as z value can be a very  
rough approximation.  The standard error is radically different, on a  
logarithmic scale, between log(mu) = -20.30 [the best glm() managed  
in approximating -infinity] and log(mu) + log(a) = -0.29.  It is  
actually worse than might appear; the SE=2457.38 is an approximation  
to infinity!  The phenomenon is an extreme version, now with a  
poisson error model, of the Hauck-Donner effect (Modern Applied  
Statistics with S, 4th edn, pp.197-199) that occurs with binomial  
data.  Use the result from the anova likelihood ratio test, where the  
approximations that are involved are usually much better behaved (but  
it would not hurt to do a simulation as a check.)

There is an example of this phenomenon with a poisson error model in  
Subsection 8.4.2 (the same subsection number both for the 1st edn and  
the forthcoming 2nd edn) of Data Analysis  Graphics using R, CUP,  
2003 and 2006. Install and attach the DAAG package and try

example(moths)

John Maindonald email: [EMAIL PROTECTED]
phone : +61 2 (6125)3473fax  : +61 2(6125)5549
Centre for Mathematics  Its Applications, Room 1194,
John Dedman Mathematical Sciences Building (Building 27)
Australian National University, Canberra ACT 0200.

 Dear members,
 here is my trouble: My data consists of counts of trapped insects  
 in different attractive traps. I usually use GLMs with a poisson  
 error distribution to find out the differences between my  
 traitments (and to look at other factor effects). But for some  
 dataset where one traitment contains only zeros, GLM with poisson  
 family fail to find any difference between this particular  
 traitment and anyother one (even with traitment that have trapped a  
 lot of insects). GLMs with gaussian family does not seem to have  
 the same problem but GLMs with binomial family does.
 I'm not sure if it is a statistical problem or if it comes from  
 R... in the latter case I think some solution exists (perhaps in  
 the options of the glm() function ?).
 Thank you for your help.


 Here I figure out an exemple to past in the console:

 ## START  
 ## 
 
 # Take a data set of counts for two traitments, one containing only  
 zeros
 A=c 
 (0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 
 ,0,0,0,0,0)
 B=c 
 (1,0,0,0,2,1,0,0,1,2,0,0,0,1,2,2,0,1,1,0,1,0,2,1,1,0,1,2,0,1,0,1,1,1,0 
 ,1,1,1,0,1)
 traitment=c(rep(A,40),rep(B,40))
 response=c(A,B)
 mydata=data.frame(traitment ,response)


 # Make a GLM on this dataset , with family=poisson

  g=glm(response~traitment, data=mydata, family=poisson)
  anova.glm(g,test=Chisq)
 # There is an effect of the traitment ...

  summary(g)
 # But traitment A does not differ from traitment B ! ! ! (the  
 pvalue is always close from 1 in such cases)

 # Now if you replace only one zero of the A reponse to 1, the GLM  
 works properly:
  mydata[1,2]=1
  g=glm(response~traitment, data=mydata, family=poisson)
  anova.glm(g,test=Chisq)
  summary(g)
 ## 
 ###  END ##



 Antonin Ferry (PhD)

 Laboratoire d'Ecobiologie des Insectes Parasitoides
 http://www.parasitoides.univ-rennes1.fr
 Université de Renes1, FRANCE

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

[R] unexpected result in glm (family=poisson) for data with an only zero response in one factor

2006-09-13 Thread John Maindonald

PS.  In part, the problem is with the use of the log link, arising  
because the limit of log(mu), as mu goes to 0, is minus infinity.   
This is not an appropriate scale on which to represent a fitted value  
that is zero.  The estimated SE for a fitted value of zero should be  
0.  You will get a more sensible answer if you set family=poisson 
(link=sqrt)

g - glm(response~traitment, data=mydata, family=poisson(link=sqrt))
  summary(g)

Call:
glm(formula = response ~ traitment, family = poisson(link = sqrt),
 data = mydata)

Deviance Residuals:
Min  1Q  Median  3Q Max
-1.225e+00  -2.730e-05  -2.730e-05   2.745e-01   1.193e+00

Coefficients:
  Estimate Std. Error  z value Pr(|z|)
(Intercept) 0.193  0.0790569 0.0002441
traitmentB  0.8660061  0.11180347.746  9.5e-15

(Dispersion parameter for poisson family taken to be 1)

 Null deviance: 75.485  on 79  degrees of freedom
Residual deviance: 33.896  on 78  degrees of freedom
AIC: 89.579

Number of Fisher Scoring iterations: 14


John Maindonald email: [EMAIL PROTECTED]
phone : +61 2 (6125)3473fax  : +61 2(6125)5549
Centre for Mathematics  Its Applications, Room 1194,
John Dedman Mathematical Sciences Building (Building 27)
Australian National University, Canberra ACT 0200.

 Dear members,
 here is my trouble: My data consists of counts of trapped insects  
 in different attractive traps. I usually use GLMs with a poisson  
 error distribution to find out the differences between my  
 traitments (and to look at other factor effects). But for some  
 dataset where one traitment contains only zeros, GLM with poisson  
 family fail to find any difference between this particular  
 traitment and anyother one (even with traitment that have trapped a  
 lot of insects). GLMs with gaussian family does not seem to have  
 the same problem but GLMs with binomial family does.
 I'm not sure if it is a statistical problem or if it comes from  
 R... in the latter case I think some solution exists (perhaps in  
 the options of the glm() function ?).
 Thank you for your help.


 Here I figure out an exemple to past in the console:

 ## START  
 ## 
 
 # Take a data set of counts for two traitments, one containing only  
 zeros
 A=c 
 (0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 
 ,0,0,0,0,0)
 B=c 
 (1,0,0,0,2,1,0,0,1,2,0,0,0,1,2,2,0,1,1,0,1,0,2,1,1,0,1,2,0,1,0,1,1,1,0 
 ,1,1,1,0,1)
 traitment=c(rep(A,40),rep(B,40))
 response=c(A,B)
 mydata=data.frame(traitment ,response)


 # Make a GLM on this dataset , with family=poisson

  g=glm(response~traitment, data=mydata, family=poisson)
  anova.glm(g,test=Chisq)
 # There is an effect of the traitment ...

  summary(g)
 # But traitment A does not differ from traitment B ! ! ! (the  
 pvalue is always close from 1 in such cases)

 # Now if you replace only one zero of the A reponse to 1, the GLM  
 works properly:
  mydata[1,2]=1
  g=glm(response~traitment, data=mydata, family=poisson)
  anova.glm(g,test=Chisq)
  summary(g)
 ## 
 ###  END ##



 Antonin Ferry (PhD)

 Laboratoire d'Ecobiologie des Insectes Parasitoides
 http://www.parasitoides.univ-rennes1.fr
 Université de Renes1, FRANCE

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Conservative ANOVA tables in lmer

2006-09-10 Thread John Maindonald

A Wiki entry is an excellent idea.  I am happy to try to help.

An account of mcmcsamp() might be very useful part of the Wiki.  My  
limited investigations suggest that once the data starts to overwhelm  
the prior (maybe ~3 df for an effect that is of interest), the  
posterior distribution that it gives provides a very good  
approximation to the sampling distribution.

I have been meaning to put aside time to try to work out, with the  
help of a colleague here at ANU, how the Kenward  Roger (Biometrics,  
1997) approximation might be implemented in lmer, but it has'nt yet  
happened and is unlikely to do so for a while.

John Maindonald email: [EMAIL PROTECTED]
phone : +61 2 (6125)3473fax  : +61 2(6125)5549
Centre for Mathematics  Its Applications, Room 1194,
John Dedman Mathematical Sciences Building (Building 27)
Australian National University, Canberra ACT 0200.


On 10 Sep 2006, at 8:00 PM, [EMAIL PROTECTED] wrote:

 From: Spencer Graves [EMAIL PROTECTED]
 Date: 10 September 2006 4:54:50 PM
 To: Andrew Robinson [EMAIL PROTECTED]
 Cc: Douglas Bates [EMAIL PROTECTED], R-Help r- 
 [EMAIL PROTECTED]
 Subject: Re: [R] Conservative ANOVA tables in lmer


 Hi, Doug, et al.:
  I'll volunteer to do the same, which is an extension of much  
 of what I've been doing for R Help for a while now.
  Regarding writing a FAQ, what about a Wiki entry (and maybe  
 ultimately a vignette)?  This thread provides notes around which  
 such could be built.  Another piece might be an example from  
 Scheffé (1958), which I sent as a reply to an earlier comment on  
 this thread, (foolishly sent without reducing the cc list, which  
 means it awaits moderator approval).   Each time a question of  
 this nature arises, someone checks the Wiki, edits adds something  
 to it if necessary, then replies to the list with the reference to  
 the appropriate Wiki entry.
  Spencer Graves

 Andrew Robinson wrote:
 On Thu, Sep 07, 2006 at 07:59:58AM -0500, Douglas Bates wrote:


 I would be happy to re-institute p-values for fixed effects in the
 summary and anova methods for lmer objects using a denominator  
 degrees
 of freedom based on the trace of the hat matrix or the rank of  
 Z:X if
 others will volunteer to respond to the these answers are obviously
 wrong because they don't agree with whatever and the idiot who  
 wrote
 this software should be thrashed to within an inch of his life
 messages.  I don't have the patience.


 This seems to be more than fair to me.  I'll volunteer to help  
 explain
 why the anova.lmer() output doesn't match SAS, etc.  Is it worth
 putting a caveat in the output and the help files?  Is it even worth
 writing a FAQ about this?

 Cheers

 Andrew


[[alternative HTML version deleted]]

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

[R] Robustness of linear mixed models

2006-06-27 Thread John Maindonald

I'd use mcmcsamp() to examine the posterior distribution, under
a relatively uninformative prior, of of the parameter estimates.
For estimates that are based on four or five or more degrees of
freedom, I'd surmise that the prior will not matter too much.
With estimates where the number of degrees of freedom is
one or two or three, the posterior distribution may vary greatly
from one run of mcmcsamp() to another.  Of course, the definition
of degrees of freedom becomes quite fraught if there is severe
imbalance; e.g., some of the items that contribute to the estimate
based on much more data than others.

Subject to such caveats as just noted, I'd expect that credible
intervals derived from the posterior distributions would be close
to the usual frequentist confidence intervals.  The main effect of
the non-normality may be that the estimates are inefficient, i.e.,
the variance may be larger, or the distributions more dispersed,
than for true maximum likelihood estimates, were you able to
obtain them!
John Maindonald

John Maindonald email: [EMAIL PROTECTED]
phone : +61 2 (6125)3473fax  : +61 2(6125)5549
Mathematical Sciences Institute, Room 1194,
John Dedman Mathematical Sciences Building (Building 27)
Australian National University, Canberra ACT 0200.


On 27 Jun 2006, at 8:00 PM, [EMAIL PROTECTED] wrote:

 From: Bruno L. Giordano [EMAIL PROTECTED]
 Date: 27 June 2006 11:21:25 AM
 To: r-help@stat.math.ethz.ch
 Subject: [R]  Robustness of linear mixed models


 Hello,

 with 4 different linear mixed models (continuous dependent) I find  
 that my residuals do not follow the normality assumption  
 (significant Shapiro-Wilk with values equal/higher than 0.976;  
 sample sizes 750 or 1200). I find, instead, that my residuals are  
 really well fitted by a t distribution with dofs' ranging, in the  
 different datasets, from 5 to 12.

 Should this be considered such a severe violation of the normality  
 assumption as to make model-based inferences invalid?

 Thanks a lot,
Bruno


[[alternative HTML version deleted]]

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html

Re: [R] regression modeling

2006-04-26 Thread John Maindonald

An interesting example of this is the forest cover data set that is
available from http://www.ics.uci.edu/~mlearn
The proportions of the different cover types change systematically
as one moves through the file.  It seems that distance through the
file is a proxy for the geographical co-ordinates.  Fitting a tree-based
or suchlike model to the total data is not the way to go, unless one
is going to model the pattern of change through the file as part of the
modeling exercise.  In any case, some preliminary exploration of the
data is called for, so that such matters come to attention.  For my
money, the issue is not ease of performing regression with huge data
sets, but ease of data exploration.

John Maindonald email: [EMAIL PROTECTED]
phone : +61 2 (6125)3473fax  : +61 2(6125)5549
Mathematical Sciences Institute, Room 1194,
John Dedman Mathematical Sciences Building (Building 27)
Australian National University, Canberra ACT 0200.


On 26 Apr 2006, at 8:00 PM, [EMAIL PROTECTED] wrote:

 From: Berton Gunter [EMAIL PROTECTED]
 Date: 26 April 2006 6:47:12 AM
 To: 'Weiwei Shi' [EMAIL PROTECTED], 'bogdan romocea'  
 [EMAIL PROTECTED]
 Cc: 'r-help' R-help@stat.math.ethz.ch
 Subject: Re: [R] regression modeling


 May I offer a perhaps contrary perspective on this.

 Statistical **theory** tells us that the precision of estimates  
 improves as
 sample size increases. However, in practice, this is not always the  
 case.
 The reason is that it can take time to collect that extra data, and  
 things
 change over time. So the very definition of what one is measuring, the
 measurement technology by which it is measured (think about  
 estimating tumor
 size or disease incidence or underemployment, for example), the  
 presence or
 absence of known or unknown large systematic effects, and so forth may
 change in unknown ways. This defeats, or at least complicates, the
 fundamental assumption that one is sampling from a (fixed)  
 population or
 stable (e.g. homogeneous, stationary) process, so it's no wonder  
 that all
 statistical bets are off. Of course, sometimes the necessary  
 information to
 account for these issues is present, and appropriate (but often  
 complex)
 statistical analyses can be performed. But not always.

 Thus, I am suspicious, cynical even, about those who advocate  
 collecting
 all the data and subjecting the whole vast heterogeneous mess to  
 arcane
 and ever more computer intensive (and adjustable parameter ridden)  
 data
 mining algorithms to detect trends or discover knowledge. To  
 me, it
 sounds like a prescription for turning on all the equipment and  
 waiting to
 see what happens in the science lab instead of performing careful,
 well-designed experiments.

 I realize, of course, that there are many perfectly legitimate  
 areas of
 scientific research, from geophysics to evolutionary biology to  
 sociology,
 where one cannot (easily) perform planned experiments. But my point  
 is that
 good science demands that in all circumstances, and especially when  
 one
 accumulates and attempts to aggregata data taken over spans of time  
 and
 space, one needs to beware of oversimplification, including  
 statistical
 oversimplification. So interrogate the measurement, be skeptical of
 stability, expect inconsistency. While all models are wrong but  
 some are
 useful (George Box), the second law tells us that entropy still  
 rules.

 (Needless to say, public or private contrary views are welcome).

 -- Bert Gunter
 Genentech Non-Clinical Statistics
 South San Francisco, CA

 The business of the statistician is to catalyze the scientific  
 learning
 process.  - George E. P. Box

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html

Re: [R] Optimal platform for R

2006-03-10 Thread John Maindonald

I've found memory management sometimes problematic under Windows.
I've  a calculation that runs without difficulty in 512MB under Mac OS X
(10.3 or 10.4); I'd expect the same under Linux.  Under Windows
(XP professional) with 512MB, it requires a freshly booted system.
But maybe the new machines will have so much memory that memory
management will not be an issue.

John Maindonald email: [EMAIL PROTECTED]
phone : +61 2 (6125)3473fax  : +61 2(6125)5549
Mathematical Sciences Institute, Room 1194,
John Dedman Mathematical Sciences Building (Building 27)
Australian National University, Canberra ACT 0200.


On 10 Mar 2006, at 10:00 PM, [EMAIL PROTECTED] wrote:

 From: Prof Brian Ripley [EMAIL PROTECTED]
 Date: 10 March 2006 6:50:03 PM
 To: [EMAIL PROTECTED]
 Cc: r-help@stat.math.ethz.ch
 Subject: Re: [R] Optimal platform for R


 On Thu, 9 Mar 2006, [EMAIL PROTECTED] wrote:

 I'm looking to buy a new desktop which will primarily be used for
 analyses of large datasets (100s of MB). I've seen postings from  
 several
 years back re the 'optimal' platform for running R, but nothing more
 recently.

 It is a subject which comes up every few months.  Many of the  
 developers are running dual (or dual-core) Opterons/Athlon 64s  
 under Linux these days.

 Specifically, I want to know: 1) if I run R under Windows, does  
 having a
 dual-processor machine help speed things up? And 2) is it still true
 that R performs about as well under Windows as Linux?

 Duncan Murdoch has already mentioned the 64-bit advantage if you  
 need large datasets, but there is also a speed penalty if you do  
 not.  Your description seems on the margins (depends how many 100s  
 and what the format is and what you want to do).  One advantage of  
 AMD64 Linux is that I can run either 32- or 64-bit versions of R  
 and choose to have speed or space for any given task.

 A dual processor will be of little help in running R faster.  R's  
 interpreter is single-threaded, and although you can get some  
 advantage in using multi-threaded BLAS libraries in large matrix  
 computations these are not readily available for R under Windows,  
 and the advantage is often small under Linux. Running two or more  
 instances of R will take advantage of dual processers, and I have  
 been running dual CPU machines for a decade.

 As for Windows vs Linux, R runs on the same hardware at about the  
 same speed when comparing the standard Windows build with a shared  
 library version on Linux (standard for e.g. the RH RPMs), but the  
 standard Linux build is 10-20% faster. For one set of comparisons see

   http://sekhon.berkeley.edu/macosx/

 -- 
 Brian D. Ripley,  [EMAIL PROTECTED]
 Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
 University of Oxford, Tel:  +44 1865 272861 (self)
 1 South Parks Road, +44 1865 272866 (PA)
 Oxford OX1 3TG, UKFax:  +44 1865 272595


__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html

Re: [R] repeated measures ANOVA

2006-02-28 Thread John Maindonald

There seem to several issues here:
1) In the analysis that has a (1|Subject) error term, there is a large
negative correlation between the parameter estimates for time and
time:group. Overall, the effect of time is significant, as can be seen
from

time.lme - lme ( p.pa ~ time * group, random = ~ 1 | subject,  
method=ML)
  notime.lme - lme ( p.pa ~ group, random = ~ 1 | subject,  
method=ML)
  anova(time.lme, notime.lme)
Model df   AIC   BIC logLik   Test L.Ratio p-value
time.lme   1  6 245.0 253.4 -116.5
notime.lme 2  4 254.0 259.6 -123.0 1 vs 2   12.95  0.0015

What is uncertain is how this time effect should be divided up, between
a main effect of slope and the interaction.

2) What the interaction plot makes clear, and what the change in  
treatment
(for group 1 only?) for time point 3 should have suggested is that  
the above
analysis is not really appropriate.  There are two comparisons:
(i) at time points 1 and 2; and (ii) at time point 3.

(3) The above does not allow for a random group to group change in
slope, additional to the change that can be expected from random
variation about the line.  Models 3 and 4 in your account do this, and
allow also for a group:subject and group:time random effects that make
matters more complicated still.  The fitting of such a model has the
consequence that between group differences in slope are entirely
explained by this random effect.  Contrary to what the lmer() output
might suggest, no degrees of freedom are left with which to estimate
the time:group interaction.
(Or you can estimate the interaction, and no degrees of freedom are
left for either the time or time:group random effect).  All you can talk
about is the average and the difference of the time effects for these
two specific groups.

Thus, following on from (3), I do not understand how lmer() is able to
calculate a t-statistic. There seems to me to be double dipping.
Certainly, I noted a convergence problem.

John Maindonald email: [EMAIL PROTECTED]
phone : +61 2 (6125)3473fax  : +61 2(6125)5549
Mathematical Sciences Institute, Room 1194,
John Dedman Mathematical Sciences Building (Building 27)
Australian National University, Canberra ACT 0200.


On 28 Feb 2006, at 10:00 PM, Christian Gold wrote:

 From: Christian Gold [EMAIL PROTECTED]
 Date: 28 February 2006 2:15:04 AM
 To: r-help@stat.math.ethz.ch
 Subject: [R] repeated measures ANOVA


 Dear list members:

 I have the following data:
 group - rep(rep(1:2, c(5,5)), 3)
 time - rep(1:3, rep(10,3))
 subject - rep(1:10, 3)
 p.pa - c(92, 44, 49, 52, 41, 34, 32, 65, 47, 58, 94, 82, 48, 60, 47,
 46, 41, 73, 60, 69, 95, 53, 44, 66, 62, 46, 53, 73, 84, 79)
 P.PA - data.frame(subject, group, time, p.pa)

 The ten subjects were randomly assigned to one of two groups and
 measured three times. (The treatment changes after the second time
 point.)

 Now I am trying to find out the most adequate way for an analysis of
 main effects and interaction. Most social scientists would call this
 analysis a repeated measures ANOVA, but I understand that mixed- 
 effects
 model is a more generic term for the same analysis. I did the analysis
 in four ways (one in SPSS, three in R):

 1. In SPSS I used general linear model, repeated measures,  
 defining a
 within-subject factor for the three different time points. (The data
 frame is structured differently in SPSS so that there is one line for
 each subject, and each time point is a separate variable.)
 Time was significant.

 2. Analogous to what is recommended in the first chapter of Pinheiro 
 Bates' Mixed-Effects Models book, I used
 library(nlme)
 summary(lme ( p.pa ~ time * group, random = ~ 1 | subject))
 Here, time was NOT significant. This was surprising not only in
 comparison with the result in SPSS, but also when looking at the  
 graph:
 interaction.plot(time, group, p.pa)

 3. I then tried a code for the lme4 package, as described by Douglas
 Bates in RNews 5(1), 2005 (p. 27-30). The result was the same as in 2.
 library(lme4)
 summary(lmer ( p.pa ~ time * group + (time*group | subject), P.PA ))

 4. The I also tried what Jonathan Baron suggests in his Notes on the
 use of R for psychology experiments and questionnaires (on CRAN):
 summary( aov ( p.pa ~ time * group + Error(subject/(time * group)) ) )
 This gives me yet another result.

 So I am confused. Which one should I use?

 Thanks

 Christian

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html

Re: [R] repeated measures ANOVA

2006-02-28 Thread John Maindonald

There was a mistake in my earlier note, that I should correct:


   (Or you can estimate the interaction, and no degrees of freedom are
   left for either the time or time:group random effect).  All you  
can talk
 ^^^
   about is the average and the difference of the time effects for these
   two specific groups.


There is no problem with the time random effect; that can be estimated
from the within group variation in slopes, between subjects.


John Maindonald email: [EMAIL PROTECTED]
phone : +61 2 (6125)3473fax  : +61 2(6125)5549
Mathematical Sciences Institute, Room 1194,
John Dedman Mathematical Sciences Building (Building 27)
Australian National University, Canberra ACT 0200.

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html

[R] In which application areas is R used?

2006-01-23 Thread John Maindonald

If anyone has a list of application areas where there is
extensive use of R, I'd like to hear of it. My current
short list is:

Bioinformatics
Epidemiology
Geophysics
Agriculture and crop science

John Maindonald
Mathematical Sciences Institute, Australian National University.
[EMAIL PROTECTED]

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html

Re: [R] In which application areas is R used?

2006-01-23 Thread John Maindonald

In this context extensive might be use of R in at least maybe 2% or 5%
of the published analyses in the area, enough to make waves and stir
awareness.

The immediate subtext is the demand of a book publisher for a list of
journals to which a new edition of a certain book might be sent for
review, and for a list of conferences where it might be given exposure.
For myself, in the medium to longer term, I am more interested in other
subtexts such as you mention, to which the answer might have relevance.

I've wondered what support there'd be for starting a database of
bibliographic information on papers where R was used for the analysis.
Authors might supply the information, or readers of a paper suggest its
addition to the database. Once well populated, this would provide a useful
indication of the range of application areas and journals where R is
finding use.  [Or has someone, somewhere, already started such a
database?]

Finance and biostatistics are obvious areas that I'd omitted.  Other areas
drawn to my attention have been telephony and electronic networks, solid
state etc manufacturing, computer system performance, oceanography and
fisheries research, risk analysis, process engineering and marketing. (I
hope my summaries are acceptably accurate).  I'm not sure what force these
other respondents have given the word extensive.
John Maindonald
Mathematical Sciences Institute
Australian National University.
[EMAIL PROTECTED]


Berton Gunter wrote:
 Define extensive.

 I think your answers depend on your definition. I know a bunch of folks
in pharmaceutical preclinical RD who use R for all sorts of stuff 
(analysis and visualization of tox and efficacy animal studies,
dose/response modeling, PK work, IC50 determination, stability data 
analysis, etc.). Is bunch a majority? I strongly doubt that it's near.
Is it 5%, 10%, 30% ?? Dunno. Excel is still the Big Boy in most of  these
arenas I would bet. But I would also bet that there are at  least 1 or 2
folks in dozens of companies who use R in for these things.

 Is there a subtext to your query? -- i.e. are you trying to make an
argument for something?

 -- Bert


 -Original Message-
 From: [EMAIL PROTECTED]
 [mailto:[EMAIL PROTECTED]

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html

Re: [R] Splitting the list

2006-01-04 Thread John Maindonald

I've changed the heading because this really is another thread.  I  
think it inevitable that there will, in the course of time, be other  
lists that are devoted, in some shape or form, to the concerns of  
practitioners (at all levels) who are using R.  One development I'd  
not like to see is fracture along application area lines, allowing  
those who are comfortable in coteries whose focus was somewhat  
relevant to standards of use of statistics in that area 15 or 20  
years ago to continue that way.  One of the great things about R, in  
its development to date, has been its role in exposing people from a  
variety of application area communities to statistical traditions  
different from that in which they have been nurtured. I expect it to  
have a continuing role in raising statistical analysis standards, in  
raising the bar.

Another possibility is fracture along geographic boundaries.  This  
has both benefits (one being that its is easier within a smaller  
circle of people who are more likely to know each other for  
contributors to establish a rapport that will make the list really  
effective; also there will be notices and discussion that are of  
local interest) and drawbacks (it risks separating subscribers off  
from important discussions on the official R lists.)  On balance,  
this may be the better way to go. Indeed subscribers to ANZSTAT  
(Australian and NZ statistical list) will know that an R-downunder  
list, hosted at Auckland, is currently in test-drive mode. There  
should be enough subscribers in common between this and the official  
R lists that the south-eastern portion of Gondwana does not, at any  
time in the very near future, float off totally on its own.

There are of course other possibilities, and it may be useful to  
canvass them.

John Maindonald email: [EMAIL PROTECTED]
phone : +61 2 (6125)3473fax  : +61 2(6125)5549
Mathematical Sciences Institute, Room 1194,
John Dedman Mathematical Sciences Building (Building 27)
Australian National University, Canberra ACT 0200.



On 4 Jan 2006, at 10:00 PM, [EMAIL PROTECTED] wrote:

 From: Ben Fairbank [EMAIL PROTECTED]
 Date: 4 January 2006 4:42:31 AM
 To: R-help@stat.math.ethz.ch
 Subject: Re: [R] A comment about R:


 One implicit point in Kjetil's message is the difficulty of learning
 enough of R to make its use a natural and desired first choice
 alternative, which I see as the point at which real progress and
 learning commence with any new language.  I agree that the long  
 learning
 curve is a serious problem, and in the past I have discussed, off  
 list,
 with one of the very senior contributors to this list the  
 possibility of
 splitting the list into sections for newcomers and for advanced users.
 He gave some very cogent reasons for not splitting, such as the
 possibility of newcomers' getting bad advice from others only slightly
 more advanced than themselves.  And yet I suspect that a newcomers'
 section would encourage the kind of mutually helpful collegiality  
 among
 newcomers that now characterizes the exchanges of the more experienced
 users on this list.  I know that I have occasionally been reluctant to
 post issues that seem too elementary or trivial to vex the others  
 on the
 list with and so have stumbled around for an hour or so seeking the
 solution to a simple problem.  Had I the counsel of others similarly
 situated progress might have been far faster.  Have other newcomers or
 occasional users had the same experience?

 Is it time to reconsider splitting this list into two sections?
 Certainly the volume of traffic could justify it.

 Ben Fairbank



[[alternative HTML version deleted]]

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html

Re: [R] R] lme X lmer results

2006-01-01 Thread John Maindonald

 From a quick look at the paper in the SAS proceedings, the simulations
seem limited to nested designs.  The major problems are with repeated
measures designs where the error structure is not compound symmetric,
which lme4 does not at present handle (unless I have missed something).
Such imbalance as was investigated was not a serious issue, at least for
the Kenward and Roger degree of freedom calculations.

The paper ends by commenting that research should continue.  What
may be even more important is to educate users to think carefully about
any df that they are presented with, and to be especially sceptical when
designs are not approximately balanced nested designs and/or there are
repeated measures error structures that are not compound symmetric.

It is also necessary to consider how well the analysis reflects matters
on which there may be existing good evidence. Suppose in Ronaldo's
case that he'd previously run a number of experiments with very similar
plots and observation al units, and with comparable treatments and
outcome measures. If the plot 1 SD estimate (i.e., at the level of
experimental units) had never been larger than 0.01, with the SD for
observational units always in a range of 2 to 20, I'd take this as  
licence
to ignore the variance at the plot 1 level.  It would be nice to be  
able to
build in such prior information more formally, probably via a modified
version of mcmcsamp().

[Some people are never satisfied, You've written a great piece of
software, and users reward you by complaining that they want even
more!]

John Maindonald.


On 1 Jan 2006, at 10:00 PM, [EMAIL PROTECTED] wrote:

 From: Dave Atkins [EMAIL PROTECTED]
 Date: 1 January 2006 1:40:45 AM

 To: r-help@stat.math.ethz.ch
 Subject: Re: [R] lme X lmer results



 Message: 18
 Date: Fri, 30 Dec 2005 12:51:59 -0600
 From: Douglas Bates [EMAIL PROTECTED]
 Subject: Re: [R] lme X lmer results
 To: John Maindonald [EMAIL PROTECTED]
 Cc: r-help@stat.math.ethz.ch
 Message-ID:
   [EMAIL PROTECTED]
 Content-Type: text/plain; charset=ISO-8859-1

 On 12/29/05, John Maindonald [EMAIL PROTECTED] wrote:

  Surely there is a correct denominator degrees of freedom if the  
 design
  is balanced, as Ronaldo's design seems to be. Assuming that he has
  specified the design correctly to lme() and that lme() is  
 getting the df
  right, the difference is between 2 df and 878 df.  If the t- 
 statistic
  for the
  second level of Xvar had been 3.0 rather than 1.1, the difference
  would be between a t-statistic equal to 0.095 and 1e-6.  In a  
 design
  where there are 10 observations on each experimental unit, and all
  comparisons are at the level of experimental units or above, df for
  all comparisons will be inflated by a factor of at least 9.

 Doug Bates commented:

 I don't want to be obtuse and argumentative but I still am not
 convinced that there is a correct denominator degrees of freedom for
 _this_ F statistic.  I may be wrong about this but I think you are
 referring to an F statistic based on a denominator from a different
 error stratum, which is not what is being quoted.  (Those are not
 given because they don't generalize to unbalanced designs.)

 This is why I would like to see someone undertake a simulation study
 to compare various approaches to inference for the fixed effects terms
 in a mixed model, using realistic (i.e. unbalanced) examples.

 Doug--

 Here is a paper that focused on the various alternatives to  
 denominator degrees of freedom in SAS and does report some  
 simulation results:

 http://www2.sas.com/proceedings/sugi26/p262-26.pdf

 Not sure whether it argues convincingly one way or the other in the  
 present discussion.

 cheers, Dave

 -- 
 Dave Atkins, PhD
 [EMAIL PROTECTED]


__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html

Re: [R] lme X lmer results

2005-12-31 Thread John Maindonald

 contribute
equally to sd1^2, and sd1^2 usually tracks quite closely to a
chi-squared distribution.

John Maindonald.


On 31 Dec 2005, at 5:51 AM, Douglas Bates wrote:

 On 12/29/05, John Maindonald [EMAIL PROTECTED] wrote:
 Surely there is a correct denominator degrees of freedom if the  
 design
 is balanced, as Ronaldo's design seems to be. Assuming that he has
 specified the design correctly to lme() and that lme() is getting  
 the df
 right, the difference is between 2 df and 878 df.  If the t-statistic
 for the
 second level of Xvar had been 3.0 rather than 1.1, the difference
 would be between a t-statistic equal to 0.095 and 1e-6.  In a design
 where there are 10 observations on each experimental unit, and all
 comparisons are at the level of experimental units or above, df for
 all comparisons will be inflated by a factor of at least 9.

 I don't want to be obtuse and argumentative but I still am not
 convinced that there is a correct denominator degrees of freedom for
 _this_ F statistic.  I may be wrong about this but I think you are
 referring to an F statistic based on a denominator from a different
 error stratum, which is not what is being quoted.  (Those are not
 given because they don't generalize to unbalanced designs.)

 This is why I would like to see someone undertake a simulation study
 to compare various approaches to inference for the fixed effects terms
 in a mixed model, using realistic (i.e. unbalanced) examples.

 It seems peculiar to me that the F statistics are being created from
 the ratios of mean squares for different terms to the _same_ mean
 square (actually a penalized sum of squares divided by the degrees of
 freedom) and the adjustment suggested to take into account the
 presence of the random effects is to change the denominator degrees of
 freedom.  I think the rationale for this is an attempt to generalized
 another approach (the use of error strata) even though it is not being
 used here.

 Rather than giving df that for the comparison(s) of interest may be
 highly inflated, I'd prefer to give no degrees of freedom at all,  
  to
 encourage users to work out df for themselves if at all possible.
 If they are not able to do this, then mcmcsamp() is a good  
 alternative,
 and may be the way to go in any case.  This has the further advantage
 of allowing assessments in cases where the relevant distribution is
 hard to get at. I'd think a warning in order that the df are upper
 bounds,
 and may be grossly inflated.

 As I said, I am willing to change this if it is shown to be grossly
 inaccurate but please show me.

 Incidentally, does mcmcsamp() do its calculations pretty well
 independently of the lmer results?

 mcmcsamp starts from the parameter estimates when creating the chain
 but that is the extent to which it depends on the lmer results.

 John Maindonald.

 On 29 Dec 2005, at 10:00 PM, [EMAIL PROTECTED] wrote:

 From: Douglas Bates [EMAIL PROTECTED]
 Date: 29 December 2005 5:59:07 AM
 To: Ronaldo Reis-Jr. [EMAIL PROTECTED]
 Cc: R-Help r-help@stat.math.ethz.ch
 Subject: Re: [R] lme X lmer results


 On 12/26/05, Ronaldo Reis-Jr. [EMAIL PROTECTED] wrote:
 Hi,

 this is not a new doubt, but is a doubt that I cant find a good
 response.

 Look this output:

 m.lme - lme(Yvar~Xvar,random=~1|Plot1/Plot2/Plot3)

 anova(m.lme)
 numDF denDF  F-value p-value
 (Intercept) 1   860 210.2457  .0001
 Xvar1 2   1.2352  0.3821
 summary(m.lme)
 Linear mixed-effects model fit by REML
  Data: NULL
   AIC  BIClogLik
   5416.59 5445.256 -2702.295

 Random effects:
  Formula: ~1 | Plot1
 (Intercept)
 StdDev: 0.000745924

  Formula: ~1 | Plot2 %in% Plot1
 (Intercept)
 StdDev: 0.000158718

  Formula: ~1 | Plot3 %in% Plot2 %in% Plot1
 (Intercept) Residual
 StdDev: 0.000196583 5.216954

 Fixed effects: Yvar ~ Xvar
Value Std.Error  DF  t-value p-value
 (Intercept)2.3545454 0.2487091 860 9.467066  0.
 XvarFactor20.3909091 0.3517278   2 1.111397  0.3821

 Number of Observations: 880
 Number of Groups:
  Plot1   Plot2 %in% Plot1
  4  8
Plot3 %in% Plot2 %in% Plot1
 20

 This is the correct result, de correct denDF for Xvar.

 I make this using lmer.

 m.lmer - lmer(Yvar~Xvar+(1|Plot1)+(1|Plot1:Plot2)+(1|Plot3))
 anova(m.lmer)
 Analysis of Variance Table
Df Sum Sq Mean Sq  Denom F value Pr(F)
 Xvar  1  33.62   33.62 878.00  1.2352 0.2667
 summary(m.lmer)
 Linear mixed-effects model fit by REML
 Formula: Yvar ~ Xvar + (1 | Plot1) + (1 | Plot1:Plot2) + (1 |  
 Plot3)
  AIC BIClogLik MLdeviance REMLdeviance
  5416.59 5445.27 -2702.295   5402.698  5404.59
 Random effects:
  GroupsNameVariance   Std.Dev.
  Plot3 (Intercept) 1.3608e-08 0.00011665
  Plot1:Plot2   (Intercept) 1.3608e-08 0.00011665
  Plot1 (Intercept) 1.3608e

[R] lme X lmer results

2005-12-29 Thread John Maindonald

Surely there is a correct denominator degrees of freedom if the design
is balanced, as Ronaldo's design seems to be. Assuming that he has
specified the design correctly to lme() and that lme() is getting the df
right, the difference is between 2 df and 878 df.  If the t-statistic  
for the
second level of Xvar had been 3.0 rather than 1.1, the difference
would be between a t-statistic equal to 0.095 and 1e-6.  In a design
where there are 10 observations on each experimental unit, and all
comparisons are at the level of experimental units or above, df for
all comparisons will be inflated by a factor of at least 9.

Rather than giving df that for the comparison(s) of interest may be
highly inflated, I'd prefer to give no degrees of freedom at all,  to
encourage users to work out df for themselves if at all possible.
If they are not able to do this, then mcmcsamp() is a good alternative,
and may be the way to go in any case.  This has the further advantage
of allowing assessments in cases where the relevant distribution is
hard to get at. I'd think a warning in order that the df are upper  
bounds,
and may be grossly inflated.

Incidentally, does mcmcsamp() do its calculations pretty well
independently of the lmer results?

John Maindonald.

On 29 Dec 2005, at 10:00 PM, [EMAIL PROTECTED] wrote:

 From: Douglas Bates [EMAIL PROTECTED]
 Date: 29 December 2005 5:59:07 AM
 To: Ronaldo Reis-Jr. [EMAIL PROTECTED]
 Cc: R-Help r-help@stat.math.ethz.ch
 Subject: Re: [R] lme X lmer results


 On 12/26/05, Ronaldo Reis-Jr. [EMAIL PROTECTED] wrote:
 Hi,

 this is not a new doubt, but is a doubt that I cant find a good  
 response.

 Look this output:

 m.lme - lme(Yvar~Xvar,random=~1|Plot1/Plot2/Plot3)

 anova(m.lme)
 numDF denDF  F-value p-value
 (Intercept) 1   860 210.2457  .0001
 Xvar1 2   1.2352  0.3821
 summary(m.lme)
 Linear mixed-effects model fit by REML
  Data: NULL
   AIC  BIClogLik
   5416.59 5445.256 -2702.295

 Random effects:
  Formula: ~1 | Plot1
 (Intercept)
 StdDev: 0.000745924

  Formula: ~1 | Plot2 %in% Plot1
 (Intercept)
 StdDev: 0.000158718

  Formula: ~1 | Plot3 %in% Plot2 %in% Plot1
 (Intercept) Residual
 StdDev: 0.000196583 5.216954

 Fixed effects: Yvar ~ Xvar
Value Std.Error  DF  t-value p-value
 (Intercept)2.3545454 0.2487091 860 9.467066  0.
 XvarFactor20.3909091 0.3517278   2 1.111397  0.3821

 Number of Observations: 880
 Number of Groups:
  Plot1   Plot2 %in% Plot1
  4  8
Plot3 %in% Plot2 %in% Plot1
 20

 This is the correct result, de correct denDF for Xvar.

 I make this using lmer.

 m.lmer - lmer(Yvar~Xvar+(1|Plot1)+(1|Plot1:Plot2)+(1|Plot3))
 anova(m.lmer)
 Analysis of Variance Table
Df Sum Sq Mean Sq  Denom F value Pr(F)
 Xvar  1  33.62   33.62 878.00  1.2352 0.2667
 summary(m.lmer)
 Linear mixed-effects model fit by REML
 Formula: Yvar ~ Xvar + (1 | Plot1) + (1 | Plot1:Plot2) + (1 | Plot3)
  AIC BIClogLik MLdeviance REMLdeviance
  5416.59 5445.27 -2702.295   5402.698  5404.59
 Random effects:
  GroupsNameVariance   Std.Dev.
  Plot3 (Intercept) 1.3608e-08 0.00011665
  Plot1:Plot2   (Intercept) 1.3608e-08 0.00011665
  Plot1 (Intercept) 1.3608e-08 0.00011665
  Residual  2.7217e+01 5.21695390
 # of obs: 880, groups: Plot3, 20; Plot1:Plot2, 8; Plot1, 4

 Fixed effects:
 Estimate Std. Error  DF t value Pr(|t|)
 (Intercept)  2.354550.24871 878  9.4671   2e-16 ***
 XvarFactor2  0.390910.35173 878  1.1114   0.2667
 ---
 Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

 Look the wrong P value, I know that it is wrong because the DF  
 used. But, In
 this case, the result is not correct. Dont have any difference of  
 the result
 using random effects with lmer and using a simple analyses with lm.

 You are assuming that there is a correct value of the denominator
 degrees of freedom.  I don't believe there is.  The statistic that is
 quoted there doesn't have exactly an F distribution so there is no
 correct degrees of freedom.

 One thing you can do with lmer is to form a Markov Chain Monte Carlo
 sample from the posterior distribution of the parameters so you can
 check to see whether the value of zero is in the middle of the
 distribution of XvarFactor2 or not.

 It would be possible for me to recreate in lmer the rules used in lme
 for calculating denominator degrees of freedom associated with terms
 of the random effects.  However, the class of models fit by lmer is
 larger than the class of models fit by lme (at least as far as the
 structure of the random-effects terms goes).  In particular lmer
 allows for random effects associated with crossed or partially crossed
 grouping factors and the rules for denominator degrees of freedom in
 lme only

[R] arima: warning when fixing MA parameters.

2005-10-13 Thread John Maindonald

I am puzzled by the warning message in the output below.  It appears
whether or not I fit the seasonal term (but the precise point of doing
this was to fit what is effectively a second seasonal term).  Is there
some deep reason why AR parameters
(Warning message: some AR parameters were fixed: ...)
should somehow intrude into the fitting of a model that has only MA  
terms?

  library(DAAG)
  attach(bomsoi)
  # The following is fine:
  arima(avrain, order=c(0,0,4), seasonal=list(order=c(0,0,1),  
period=12),
+  fixed=c(NA,0,0,NA,NA,NA))
.
  # The following generates a warning message
  arima(avrain, order=c(0,0,4), seasonal=list(order=c(0,0,1),  
period=12),
+  fixed=c(0,0,0,NA,NA,NA))

Call:
arima(x = avrain, order = c(0, 0, 4), seasonal = list(order = c(0, 0,  
1), period = 12),
 fixed = c(0, 0, 0, NA, NA, NA))

Coefficients:
   ma1  ma2  ma3 ma4 sma1  intercept
 000  0.0357  -0.1061   456.6675
s.e.000  0.1015   0.0886 7.6997

sigma^2 estimated as 6849:  log likelihood = -595.23,  aic = 1198.46
Warning message:
some AR parameters were fixed: setting transform.pars = FALSE in:  
arima(avrain, order = c(0, 0, 4), seasonal = list(order = c(0,


John Maindonald email: [EMAIL PROTECTED]
phone : +61 2 (6125)3473fax  : +61 2(6125)5549
Centre for Bioinformation Science, Room 1194,
John Dedman Mathematical Sciences Building (Building 27)
Australian National University, Canberra ACT 0200.

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html

Re: [R] Multiple expressions, when using substitute()

2005-10-11 Thread John Maindonald

Yes, I did get a very helpful reply from Marc Schwartz.  I have
had substitute() working in legend(), when the legend argument
has length one.  The challenge was to find some way to do the
equivalent of substitute() when several expressions appear in
parallel, as may be required for legend().

The trick is to use bquote() to do the substitution.  The resulting
quoted expression (of mode call) can then be an element in a
list, along with other quoted (or bquoted) expressions.   The
list elements, when passed to expression() via the args
argument of do.call(), become unquoted expressions.

Note that bquote() uses a syntax for the substitution of variables
that is different from that used by substitute().  It would be useful
to include some such example as below on the help page for
bquote():


library(DAAG)
Acmena - subset(rainforest, species=Acmena)
plot(wood~dbh, data=Acmena)
Acmena.lm - lm(log(wood) ~ log(dbh), data=Acmena)
b - round(coef(Acmena.lm), 3)
arg1 - bquote(italic(y) == .(A) * italic(x)^.(B),
list(A=b[1], B=b[2]))
arg2 - quote(where  * italic(y) * =wood;  *
   italic(x) * =dbh)
legend(topleft, legend=do.call(expression, c(arg1, arg2)),
bty=n)

John Maindonald.


On 11 Oct 2005, at 11:41 AM, Spencer Graves wrote:


   Have you received a reply to this post?  I couldn't find one,  
 and I couldn't find a solution, even though one must exist.  I can  
 get the substitute to work in main but not legend:

 B - 2:3
 eB - substitute(y==a*x^b, list(a=B[1], b=B[2]))
 plot(1:2, 1:2, main=eB)

   You should be able to construct it using mtext, but I  
 couldn't get the desired result using legend.

   hope this helps.
   spencer graves

 John Maindonald wrote:



 expression() accepts multiple expressions as arguments, thus:
 plot(1:2, 1:2)
 legend(topleft,
expression(y == a * x^b,
 where * paste(y==wood; ,   
 x==dbh)))
 Is there a way to do this when values are to be substituted
 for a and b? i.e., the first element of the legend argument
 to legend() becomes, effectively:
substitute(y == a * x^b, list(a = B[1], b=B[2]))
 John Maindonald email: [EMAIL PROTECTED]
 phone : +61 2 (6125)3473fax  : +61 2(6125)5549
 Centre for Bioinformation Science, Room 1194,
 John Dedman Mathematical Sciences Building (Building 27)
 Australian National University, Canberra ACT 0200.
 __
 R-help@stat.math.ethz.ch mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide! http://www.R-project.org/posting- 
 guide.html



 -- 
 Spencer Graves, PhD
 Senior Development Engineer
 PDF Solutions, Inc.
 333 West San Carlos Street Suite 700
 San Jose, CA 95110, USA

 [EMAIL PROTECTED]
 www.pdf.com http://www.pdf.com
 Tel:  408-938-4420
 Fax: 408-280-7915





John Maindonald email: [EMAIL PROTECTED]
phone : +61 2 (6125)3473fax  : +61 2(6125)5549
Centre for Bioinformation Science, Room 1194,
John Dedman Mathematical Sciences Building (Building 27)
Australian National University, Canberra ACT 0200.

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html

Re: [R] R/S-Plus equivalent to Genstat predict

2005-10-07 Thread John Maindonald

As an alternative to the effects package, try predict() with  
type=terms
JM

On 7 Oct 2005, at 8:00 PM, Peter Dunn wrote:

 From: [EMAIL PROTECTED]
 [mailto:[EMAIL PROTECTED] On Behalf Of Peter Dunn
 Sent: Wednesday, October 05, 2005 9:06 PM
 To: R-help mailing list
 Subject: [R] R/S-Plus equivalent to Genstat predict:
 predictions over averages of covariates

 Hi all

 I'm doing some things with a colleague comparing different
 sorts of models.  My colleague has fitted a number of glms in
 Genstat (which I have never used), while the glm I have been
 using is only available for R.

 He has a spreadsheet of fitted means from each of his models
 obtained from using the Genstat predict function.  For
 example, suppose we fit the model of the type
 glm.out - glm( y ~ factor(F1) + factor(F2) + X1 + poly(X2,2) +
poly(X3,2), family=...)

 Then he produces a table like this (made up, but similar):

 F1(level1)12.2
 F1(level2)14.2
 F1(level3)15.3
 F2(level1)10.3
 F2(level2)9.1
 X1=010.2
 X1=0.510.4
 X1=1 10.4
 X1=1.510.5
 X1=210.9
 X1=2.511.9
 X1=311.8
 X2=012.0
 X2=0.512.2
 X2=1 12.5
 X2=1.512.9
 X2=213.0
 X2=2.513.1
 X2=313.5

 Each of the numbers are a predicted mean.  So when X1=0, on
 average we predict an outcome of 10.2.

 To obtain these figures in Genstat, he uses the Genstat predict
 function.  When I asked for an explanation of how it was done
 (ie to make the predictions, what values of the other
 covariates were used) I was told:

 So, for a one-dimensional table of fitted means for any factor (or
 variate), all other variates are set to their average

 values; and the

 factor constants (including the first, at zero) are given a

 weighted

 average depending on their respective numbers of observations.

 So for quantitative variables (such as pH), one uses the mean
 pH in the data set when making the predictions.  Reasonable anmd easy.

 But for categorical variables (like Month), he implies we use
 a weighted average of the fitted coefficients for all the
 months, depending on the proportion of times those factor
 levels appear in the data.

 (I hope I explained that OK...)

 Is there an equivalent way in R or S-Plus of doing this?  I
 have to do it for a number of sites and species, so an
 automated way would be useful.  I have tried searching to no
 avail (but may not be searching on the correct terms), and
 tried hard-coding something myself as yet unsuccessfully:
 The  poly  terms and the use of the weighted averaging over
 the factor levels are proving a bit too much for my limited skills.

 Any assistance appreciated.  (Any clarification of what I
 mean can be provided if I have not been clear.)

 Thanks, as always.

 P.

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html

[R] Multiple expressions, when using substitute()

2005-10-01 Thread John Maindonald

expression() accepts multiple expressions as arguments, thus:

plot(1:2, 1:2)
legend(topleft,
   expression(y == a * x^b,
where * paste(y==wood; ,  
x==dbh)))

Is there a way to do this when values are to be substituted
for a and b? i.e., the first element of the legend argument
to legend() becomes, effectively:
   substitute(y == a * x^b, list(a = B[1], b=B[2]))

John Maindonald email: [EMAIL PROTECTED]
phone : +61 2 (6125)3473fax  : +61 2(6125)5549
Centre for Bioinformation Science, Room 1194,
John Dedman Mathematical Sciences Building (Building 27)
Australian National University, Canberra ACT 0200.

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html

Re: [R] R-help Digest, Vol 31, Issue 30

2005-09-30 Thread John Maindonald

With lme4, use of mcmcsamp can be insightful.  (Douglas Bates
drew my attention to this function in a private exchange of emails.)
The distributions of random effects are simulated on a log scale,
where the distributions are much closer to symmetry than on the
scale of the random effects themselves.  As far as I can see, this is
a straightforward use of MCMC to estimate model parameters; it is
not clear to me the results from the lmer() fit are used.
John Maindonald.


On 30 Sep 2005, at 8:00 PM, [EMAIL PROTECTED] wrote:

 From: Roel de Jong [EMAIL PROTECTED]
 Date: 29 September 2005 11:19:38 PM
 To: r-help@stat.math.ethz.ch
 Subject: [R] standard error of variances and covariances of the  
 randomeffects with LME


 Hello,

 how do I obtain standard errors of variances and covariances of the  
 random effects with LME comparable to those of for example MlWin? I  
 know you shouldn't use them because the distribution of the  
 estimator isn't symmetric blablabla, but I need a measure of the  
 variance of those estimates for pooling my multiple imputation  
 results.

 Regards,
   Roel.

John Maindonald email: [EMAIL PROTECTED]
phone : +61 2 (6125)3473fax  : +61 2(6125)5549
Centre for Bioinformation Science, Room 1194,
John Dedman Mathematical Sciences Building (Building 27)
Australian National University, Canberra ACT 0200.

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html

Re: [R] Discrepancy between R and SPSS in 2-way, repeated measures ANOVA

2005-09-12 Thread John Maindonald

For the record, it turns out that EXPNO ran from 1 to 20, i.e., it  
identified
subject.

Thus EXPNO/COND parsed into the two error terms (additional to residual)
EXPNO and EXPNO:COND.  This second error term accounts for all
variation between levels of COND; so there is no COND sum of squares.
(In SPSS the fixed effect COND may have taken precedence; I do not
know for sure.)

In R, if this was a complete randomized design, the term Error(EXPO),
or in the mock-up example I gave Error(subj), would be enough on its  
own.

The R implementation can handle error terms akin to Error(REPNO/subj),
but because there are redundant model matrix columns generated by the
REPNO:subj term, complains that the Error() model is singular.

In general, terms of the form a/b should be used only if b is nested  
within a,
i.e.,
REPNO/IndividualWithinBlock
(where IndividualWithinBlock runs from 1 to 4)
not REPNO/subj.
(Either of these cause REPNO to be treated as a blocking factor).

  xy - expand.grid(REPNO=letters[1:5], COND=letters[1:4],
+TIME=factor(paste(1:2)))
  xy$subj - factor(paste(xy$REPNO, xy$COND, sep=:))
  ## Below subj becomes EXPNO
  xy$COND - factor(xy$COND)
  xy$REPNO - factor(xy$REPNO)
  xy$y - rnorm(40)

Plea to those who post such questions to the list:
~
Please Include either a toy data set or, if the actual data set is  
small,
lists of factor values.  If you are happy to make the information  
public,
give the result vector also (this is less important!)  Or you can put  
the
data and, where relevant, your code, on a web site.

Be careful about the use of the word groups in an experimental
design context; speak of treatment groups if that is the meaning,
or blocks if that is what is intended.  I suspect that confusion
between these two contexts in which the word groups is wont to
be used lay behind the use of the EXPNO/COND form of
model formula.

John Maindonald.


On 10 Sep 2005, at 8:00 PM, Larry A Sonna wrote:


 From: Larry A Sonna [EMAIL PROTECTED]
 Date: 10 September 2005 12:10:06 AM
 To: r-help@stat.math.ethz.ch
 Subject: [R] Discrepancy between R and SPSS in 2-way, repeated  
 measures ANOVA


 Dear R community,

 I am trying to resolve a discrepancy between the way SPSS and R  
 handle 2-way, repeated measures ANOVA.

 An experiment was performed in which samples were drawn before and  
 after treatment of four groups of subjects (control and disease  
 states 1, 2 and 3).  Each group contained five subjects.  An  
 experimental measurement was performed on each sample to yield a  
 signal.  The before and after treatment signals for each subject  
 were treated as repeated measures.  We desire to obtain P values  
 for disease state (CONDITION), and the interaction between signal  
 over time and disease state (CONDITION*TIME).

 Using SPSS, the following output was obtained:
  DFSumSq (Type 3)Mean SqF  
 value P=

 COND  3 4286114287
 3.645 0.0355

 TIME1 473
 473   0.175 0.681

 COND*TIME 3 975   325
 0.120 0.947

 Error1643219 2701



 By contrast, using the following R command:

 summary(aov(SIGNAL~(COND+TIME+COND*TIME)+Error(EXPNO/COND),  
 Type=III))

 the output was as follows:

  Df Sum Sq Mean Sq F value  Pr(F)

 COND  3  26516   8839  3.2517 0.03651 *

 TIME1473 473  0.1739 0.67986

 COND:TIME  3975 325  0.1195 0.94785

 Residuals 2876107  2718



 I don't understand why the two results are discrepant.  In  
 particular, I'm not sure why R is yielding 28 DF for the residuals  
 whereas SPSS only yields 16.  Can anyone help?

John Maindonald email: [EMAIL PROTECTED]
phone : +61 2 (6125)3473fax  : +61 2(6125)5549
Centre for Bioinformation Science, Room 1194,
John Dedman Mathematical Sciences Building (Building 27)
Australian National University, Canberra ACT 0200.


John Maindonald email: [EMAIL PROTECTED]
phone : +61 2 (6125)3473fax  : +61 2(6125)5549
Centre for Bioinformation Science, Room 1194,
John Dedman Mathematical Sciences Building (Building 27)
Australian National University, Canberra ACT 0200.

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html

Re: [R] Discrepancy between R and SPSS in 2-way, repeated measures ANOVA

2005-09-10 Thread John Maindonald

There are 20 distinct individuals, right? expno breaks the 20
individuals into five groups of 4, right? Is this a blocking factor?

If expno is treated as a blocking factor, the following is what you get:

  xy - expand.grid(expno=letters[1:5],cond=letters[1:4],
+time=factor(paste(1:2)))
  xy$subj - factor(paste(xy$expno, xy$cond, sep=:))
  xy$cond - factor(xy$cond)
  xy$expno - factor(xy$expno)
  xy$y - rnorm(40)
  summary(aov(y~cond*time+Error(expno/cond), data=xy))

Error: expno
   Df Sum Sq Mean Sq F value Pr(F)
Residuals  4   3.590.90

Error: expno:cond
   Df Sum Sq Mean Sq F value Pr(F)
cond   3   1.060.350.36   0.78
Residuals 12  11.860.99

Error: Within
   Df Sum Sq Mean Sq F value Pr(F)
time   1   2.272.271.38   0.26
cond:time  3   3.271.090.67   0.59
Residuals 16  26.191.64


If on the other hand this is analyzed as for a complete
randomized design, the following is the output:

  summary(aov(y~cond*time+Error(subj), data=xy))

Error: subj
   Df Sum Sq Mean Sq F value Pr(F)
cond   3   1.060.350.37   0.78
Residuals 16  15.460.97

Error: Within
   Df Sum Sq Mean Sq F value Pr(F)
time   1   2.272.271.38   0.26
cond:time  3   3.271.090.67   0.59
Residuals 16  26.191.64



On 10 Sep 2005, at 8:00 PM, Larry A Sonna wrote:

 From: Larry A Sonna [EMAIL PROTECTED]
 Date: 10 September 2005 12:10:06 AM
 To: r-help@stat.math.ethz.ch
 Subject: [R] Discrepancy between R and SPSS in 2-way, repeated  
 measures ANOVA


 Dear R community,

 I am trying to resolve a discrepancy between the way SPSS and R  
 handle 2-way, repeated measures ANOVA.

 An experiment was performed in which samples were drawn before and  
 after treatment of four groups of subjects (control and disease  
 states 1, 2 and 3).  Each group contained five subjects.  An  
 experimental measurement was performed on each sample to yield a  
 signal.  The before and after treatment signals for each subject  
 were treated as repeated measures.  We desire to obtain P values  
 for disease state (CONDITION), and the interaction between signal  
 over time and disease state (CONDITION*TIME).

 Using SPSS, the following output was obtained:
  DFSumSq (Type 3)Mean SqF  
 value P=

 COND  3 4286114287
 3.645 0.0355

 TIME1 473
 473   0.175 0.681

 COND*TIME 3 975   325
 0.120 0.947

 Error1643219 2701



 By contrast, using the following R command:

 summary(aov(SIGNAL~(COND+TIME+COND*TIME)+Error(EXPNO/COND),  
 Type=III))

 the output was as follows:

  Df Sum Sq Mean Sq F value  Pr(F)

 COND  3  26516   8839  3.2517 0.03651 *

 TIME1473 473  0.1739 0.67986

 COND:TIME  3975 325  0.1195 0.94785

 Residuals 2876107  2718



 I don't understand why the two results are discrepant.  In  
 particular, I'm not sure why R is yielding 28 DF for the residuals  
 whereas SPSS only yields 16.  Can anyone help?



John Maindonald email: [EMAIL PROTECTED]
phone : +61 2 (6125)3473fax  : +61 2(6125)5549
Centre for Bioinformation Science, Room 1194,
John Dedman Mathematical Sciences Building (Building 27)
Australian National University, Canberra ACT 0200.

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html

Re: [R] colors and palettes and things...

2005-05-26 Thread John Maindonald


The DAAG package has the following:

  show.colors(type=c(singles, shades, grayshades), 
order.cols=TRUE)


I am sure there are better ways to do the ordering than my ad hoc 
approach,

though.

John Maindonald email: [EMAIL PROTECTED]
phone : +61 2 (6125)3473fax  : +61 2(6125)5549
Centre for Bioinformation Science, Room 1194,
John Dedman Mathematical Sciences Building (Building 27)
Australian National University, Canberra ACT 0200.

On 24 May 2005, at 11:04 AM, [EMAIL PROTECTED] wrote:


From: Jeff D. Hamann [EMAIL PROTECTED]
Date: 23 May 2005 4:35:27 PM
To: r-help@stat.math.ethz.ch
Subject: [R] colors and palettes and things...
Reply-To: [EMAIL PROTECTED]


After trying to find if there was a color picker in the FAQs and the 
help,

I thought I would send a post here. I was overwhelmed with all the
wonderful color choices R has predefined (discovered after typing in
colors()) but can't figure out what they all (by name) look like. Is 
there

a color picker or some other method to display all those colors next to
the name?

I think I can put together palettes, but another question I have then
regards the building of palettes (a list of variable length I can 
select

or create myself other than the ones defined by Palette) so I can pass
these colors into functions instead of having to predefine a bunch of
colors myself or use the predefined colors like terrain.colors(n)?

Are there groups of colors in the colors() that I can group together to
make some nice palettes for drawing barplots, etc?

Thanks,
Jeff.


--
Jeff D. Hamann
Forest Informatics, Inc.
PO Box 1421
Corvallis, Oregon 97339-1421
phone 541-754-1428
fax 541-752-0288
[EMAIL PROTECTED]
www.forestinformatics.com


__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html

[R] Garbled plot label

2005-04-26 Thread John Maindonald

Marc Schwartz (MSchwartz at MedAnalytics.com) wrote
Here is another option. One issue is that given the way in which 
plot.lm
() is coded, some of the axis labels are hard coded when passed to the
four underlying plot functions and as far as I can tell, there is no 
way
to use a 'par' and just blank out the x axis labels only. Thus, both x
and y axis labels need to be blanked and then separately created using
title().

Thus, here is a plot.lm2() function. It's a bit kludgy, but it seems to
work, though other eyes should look at it for any errors.
What it effectively does is to do each of the four plots in plot.lm()
individually without labels (ann = FALSE) and then adds them, generally
based upon the way it is done in plot.lm(). The x axis labels are paste
()'d to the wrapped model expression to create a multi-line sub.title
for each plot.
The wrap.len argument is the 'width' argument for strwrap indicating 
the
target line wrapping length. Note that if you get to around 3 lines, 
you
will likely need to modify the margins in the plot for side 1 to 
provide
for more room.
I cannot see how to set this up so that it works in every situation.
The only clean and simple way that I can see to handle the problem
is to set a default that tests whether the formula is broken into 
multiple
text elements, and if it is then omit it.  Users can then use their own
imaginative skills, and such suggestions as have been made on
r-help, to construct whatever form of labeling best suits their case,
their imaginative skills and their coding skills.

The web page http://wwwmaths.anu.edu.au/~johnm/r/plot-lm/
has image files plot.lm.RData and plot6.lm.RData that are proposals
for a revised version of plot.lm(). The changes made so far do not
deal with the long formula issue, but if nothing better turns up my
proposal will be to proceed as I have just indicated.
John Maindonald email: [EMAIL PROTECTED]
phone : +61 2 (6125)3473fax  : +61 2(6125)5549
Centre for Bioinformation Science, Room 1194,
John Dedman Mathematical Sciences Building (Building 27)
Australian National University, Canberra ACT 0200.
__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html

[R] Odd diagnostic plots in mixed-effects models

2005-04-19 Thread John Maindonald

Is the fixed effect estimated at the innermost level?  If not,
plots of residuals at that level are surely of limited interest.
qqplots, to be relevant, surely need to assess normality of
effects (rather than residuals) at the level that matters for
the intended inferences.
If the fixed effect is estimated at the level of the random
effect, then of course there are just 12 effects that should
appear in any qq or suchlike plot.
John Maindonald email: [EMAIL PROTECTED]
phone : +61 2 (6125)3473fax  : +61 2(6125)5549
Centre for Bioinformation Science, Room 1194,
John Dedman Mathematical Sciences Building (Building 27)
Australian National University, Canberra ACT 0200.
On 19 Apr 2005, at 8:03 PM, [EMAIL PROTECTED] wrote:
From: Andrew Robinson [EMAIL PROTECTED]
Date: 19 April 2005 12:41:24 PM
To: r-help@stat.math.ethz.ch
Subject: [R] Odd diagnostic plots in mixed-effects models
Dear R community,
In the excellent nlme package the default diagnostic plot graphs the 
innermost residuals against innermost fitted values.  I recently fit a 
mixed-effects model in which there was a very clear positive linear 
trend in this plot.

I inferred that this trend occurred because my fixed effect was a 
two-level factor, and my random effect was a 12-level factor. The 
negative residuals were associated with negative random effects 
(because of shrinkage, I assume), and the positive with positive.  The 
fixed effects explained little varaition. Therefore plotting the 
innermost residuals against the innermost fitted values had the 
negative residuals to the left and the positive residuals to the 
right, occasioning a trend.

My questions are: is it (as I suspect) harmless, or does it suggest 
that the model is lacking?  And, is this effect likely to compromise 
the interpretation of any of the other standard diagnostic plots (eg 
qqnorm)?

Thanks much for any thoughts,
Andrew
--
Andrew Robinson  Ph: 208 885 7115
Department of Forest Resources   Fa: 208 885 6226
University of Idaho  E : [EMAIL PROTECTED]
PO Box 441133W : http://www.uidaho.edu/~andrewr
Moscow ID 83843  Or: 
http://www.biometrics.uidaho.edu
__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html

[R] Base and lattice graphics on the same graphics page

2005-03-29 Thread John Maindonald

Although base graphics does not mix with lattice in the one graph,
I've found that print.trellis(position=..., ) and the use of 
par(fig=...)
to put regular and trellis graphics on the one graphics page works
like a treat, at least in version 2.0.1 of R.  [Base graphics functions
that are themselves inconsistent with par(fig=...) are obviously
disallowed.]

I am wondering whether there are caveats of which I and others
should be aware, or whether there is a risk that the ongoing
development of R's graphics abilities will render such a cohabitation
unworkably fractious.
Example:
gph - bwplot(voice.part ~ height, data=singer)
print(gph, position=c(0, 0.5, 1, 1)) # x0, y0, x1, y1
par(fig=c(0, 1, 0,0.5), new=TRUE) # x0, x1, y0, y1
boxplot(height ~ voice.part, data=singer, horiz=TRUE)
John Maindonald email: [EMAIL PROTECTED]
phone : +61 2 (6125)3473fax  : +61 2(6125)5549
Centre for Bioinformation Science, Room 1194,
John Dedman Mathematical Sciences Building (Building 27)
Australian National University, Canberra ACT 0200.
__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html

[R] Re: R-help Digest, Vol 24, Issue 28

2005-02-28 Thread John Maindonald

You've omitted a comma. races2000 is a data frame,
which for purposes of extracting rows behaves like
a 2-dimenional object.  The following works fine:
  hills2000 - races2000[races2000$type == 'hill', ]
Additionally, you might like to ponder
   type - races2000[names(races2000)==type]
   type[1:4]
  Error in [.data.frame(type, 1:4) : undefined columns selected
   length(type)  # type is a data frame with 1 column
  [1] 1
   vtype - unlist(type)# Extract the vector that is the one
 # data frame (list) element
   vtype[1:4]  # Try also length(vtype)
  type.type1 type.type2 type.type3 type.type4
uphillotherotherrelay
Your syntax (without the comma) does give a result,
providing that the dimensions match (the condition must
have the same number of elements as races2000 has
columns), but it is probably not the result that you want!
See further pp.320-321 of the DAAG book.
John Maindonald email: [EMAIL PROTECTED]
phone : +61 2 (6125)3473fax  : +61 2(6125)5549
Centre for Bioinformation Science, Room 1194,
John Dedman Mathematical Sciences Building (Building 27)
Australian National University, Canberra ACT 0200.
On 28 Feb 2005, at 10:07 PM, [EMAIL PROTECTED] wrote:
From: Clint Harshaw [EMAIL PROTECTED]
Date: 28 February 2005 1:08:36 AM
To: r-help@stat.math.ethz.ch
Subject: [R] subsetting data set dimenion problem
(See DAAG book, p. 173, ex. 3)
I'm a new user of R, and I'm following the DAAG text. I want to create 
a subset of the races2000 data frame, but get errors because of a 
mismatch of values in some columns:

 library(DAAG)
 attach(races2000)
 hills2000 - races2000[races2000$type == 'hill']
Error in as.matrix.data.frame(x) : dim- : dims [product 770] do not 
match the length of object [771]

However, if I follow the solution given, and remove redundant columns 
1 through 6 and column 11 (which I won't need, since I know they are 
going to have the same value), I don't get the error:

 hills2000 - races2000[races2000$type == 'hill', -c(1:6,11)]
 hills2000
   dist climb  time  timef
Tiso Carnethy  6.00  2500 0.782  0.9191667
[...]
Cornalees  5.50   800 0.618 NA
[...]
What is causing the error with my original subsetting? I speculated it 
was related to the NA values, but there is an NA in the resulting 
hills2000, corresponding to the Cornalees hill race.

Thanks,
Clint
--
Clint Harshaw, PhD  
Department of Mathematics
Presbyterian College
Clinton SC  29325
EMail: [EMAIL PROTECTED]
Phone: 864.833.8995
Fax: 864.938.3769
Office: Harrington-Peachtree Rm 412
John Maindonald email: [EMAIL PROTECTED]
phone : +61 2 (6125)3473fax  : +61 2(6125)5549
Centre for Bioinformation Science, Room 1194,
John Dedman Mathematical Sciences Building (Building 27)
Australian National University, Canberra ACT 0200.
__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html

[R] read.table

2005-02-26 Thread John Maindonald

In addition to other suggestions made, note also count.fields().
 cat(10 9 17  # First of 7 lines, 11 13 1 6, 9 14 16,
+ 12 15 14, 8 15 15, 9 13 12, 7 14 18,
+ file=oneBadRow.txt, sep=\n)
 nfields - count.fields(oneBadRow.txt)
 nfields
[1] 3 4 3 3 3 3 3
 table(nfields) ## Use with many records
nfields
3 4
6 1
 tab - table(nfields)
 (1:length(nfields))[nfields == 4]
[1] 2
 readLines(oneBadRow.txt, n=-1)[2]
[1] 11 13 1 6
Note the various option settings for count.fields()
John Maindonald email: [EMAIL PROTECTED]
phone : +61 2 (6125)3473fax  : +61 2(6125)5549
Centre for Bioinformation Science, Room 1194,
John Dedman Mathematical Sciences Building (Building 27)
Australian National University, Canberra ACT 0200.
On 26 Feb 2005, at 10:03 PM, [EMAIL PROTECTED] wrote:
From: Sean Davis [EMAIL PROTECTED]
Date: 26 February 2005 7:11:48 AM
To: r-help r-help@stat.math.ethz.ch
Subject: [R] read.table
I have a commonly recurring problem and wondered if folks would share 
tips.  I routinely get tab-delimited text files that I need to read 
in.  In very many cases, I get:

 a - read.table('junk.txt.txt',header=T,skip=10,sep=\t)
Error in scan(file = file, what = what, sep = sep, quote = quote, dec 
= dec,  :
	line 67 did not have 88 elements

I am typically able to go through the file and find a single quote or 
something like that causing the problem, but with a recent set of 
files, I haven't been able to find such an issue.  What can I do to 
get around this problem?  I can use perl, also

Thanks,
Sean
__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html

[R] Re: R-help Digest, Vol 24, Issue 22

2005-02-22 Thread John Maindonald

You need to give the model formula that gave your output.
There are two sources of variation (at least), within and
between locations; though it looks as though your analysis
may have tried to account for this (but if so, the terms are
not laid out in a way that makes for ready interpretation.
The design is such (two locations) that you do not have
much of a check that effects are consistent over locations.
You need to check whether results really are similar
for all cultivars and for all herbicides, so that it is
legitimate to pool as happens in the overall analysis.
If a herbicide:cultivar combination has little effect the
variability may be large, while if it has a dramatic effect
(kills everything!), there may be no variability to speak of.
John Maindonald.
On 22 Feb 2005, at 10:06 PM, [EMAIL PROTECTED] wrote:
To: 'Bob Wheeler' [EMAIL PROTECTED]
Cc: r-help@stat.math.ethz.ch
Subject: RE: [R] power.anova.test for interaction effects
Reply-To: [EMAIL PROTECTED]
It's a rather complex model.  A 37*4 factorial (37 cultivars[var]; 4
herbicide treatments[trt]) with three replications[rep] was carried 
out at
two locations[loc], with  different randomizations within each rep at 
each
location.

Source   DF   Error Term  MS
Loc   1   Trt*rep(loc)12314
Rep(loc)  4   Trt*rep(loc)1230.5
Trt   3   Trt*rep(loc)64.72
Trt*loc   3   Trt*rep(loc)33.42
Trt*rep(loc) 12   Residual76.78
Var  36   Var*trt*loc 93.91
Var*trt 108   Var*trt*loc 12.06
Var*trt*loc 144   Residual43.09
Residual575   NA  21.23
-Original Message-
From: Bob Wheeler [mailto:[EMAIL PROTECTED]
Sent: Monday, February 21, 2005 4:33 PM
To: [EMAIL PROTECTED]
Cc: r-help@stat.math.ethz.ch
Subject: Re: [R] power.anova.test for interaction effects
Your F value is so low as to make me suspect your model. Where did the
144 denominator degrees of freedom come from?
John Maindonald email: [EMAIL PROTECTED]
phone : +61 2 (6125)3473fax  : +61 2(6125)5549
Centre for Bioinformation Science, Room 1194,
John Dedman Mathematical Sciences Building (Building 27)
Australian National University, Canberra ACT 0200.
__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html

[R] is.matrix(), as.matrix, as(,matrix)

2005-02-20 Thread John Maindonald

Under help(matrix) it is written:
 'is.matrix' tests if its argument is a (strict) matrix. It is
 generic: you can write methods to handle specific classes of
 objects, see InternalMethods.
Further down, under Details, the meaning of strict is explained
more explicitly:
 'is.matrix' returns 'TRUE' if 'x' is a matrix (i.e., it is _not_ a
 'data.frame' and has a 'dim' attribute of length 2) and 'FALSE'
(i.e., strict means matrix in a broad sense)
# The following is consistent with this:
 tab - with(ToothGrowth, table(supp, dose))
 is.matrix(tab)
[1] TRUE
 class(as.matrix(tab))
[1] table
However the function as() has an argument strict that has the
connotation restricted sense:
  strict: A logical flag.  If 'TRUE', the returned object must be
  strictly from the target class (unless that class is a
  virtual class, in which case the object will be from the
  closest actual class (often the original object, if that
  class extends the virtual class directly).
# The following is consistent with this:
 class(as(tab,matrix, strict=TRUE))  # TRUE is the default
[1] matrix
 class(as(tab,matrix, strict=FALSE))  # TRUE is the default
[1] table
# Note also:
 class(data.matrix(tab))
[1] table
At the very least, the word (strict) should be removed from
 'is.matrix' tests if its argument is a (strict) matrix.
and replaced by something like (in the broad sense defined below).
It would make for greater consistency and convenience if all of
as.matrix(), is.matrix() and data.matrix()
had arguments strict, where strict=FALSE would preserve
the present behaviour.
Additionally, it would be useful to mention, under the documentation
for matrix() and data.matrix(), that as.matrix(tab) is equivalent to
as(tab, matrix, strict=FALSE) (is that strictly correct?)
I often want to use xtable() with tables.  There is no method
defined for the class table.  After a bit of rummaging, I found
that I can use:
xtable(as(tab, matrix)).
John Maindonald email: [EMAIL PROTECTED]
phone : +61 2 (6125)3473fax  : +61 2(6125)5549
Centre for Bioinformation Science, Room 1194,
John Dedman Mathematical Sciences Building (Building 27)
Australian National University, Canberra ACT 0200.
__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html

Re: [R] classification for huge datasets: SVM yields memory troubles

2004-12-14 Thread John Maindonald

While it is true that the large number of variables relative to
the number of observations restricts what can be inferred,
the situation is not as hopeless as Bert seems to suggest.
If it were, attempts at the analysis of expression array data
would be a waste to time.  Methods developed to that
general area may well be relevant to other data where the
number of variables is similarly far larger than the number
of observations.
See Ambroise, C. and Mclachlan, G.J. 2002.  Selection bias
in gene extraction on the basis of microarray gene-expression
data.  PNAS 99: 6562--6566.
This discusses some of the literature on the use of SVMs.
The selection bias that these authors discuss also affects
plots, even principal components and other ordination-base
plots where features have been selected on the basis of their
ability to separate into known groups.  I have draft versions
of code that addresses this selection bias as it affects the
plotting of graphs, which (along a paper that has been
submitted for inclusion in a conference proceedings) I am
happy to make available to anyone who wants to experiment.
Another good place to look, as a starting point, may be
Gordon Smyth's LIMMA User's Guide.  This can be a bit
hard to find. With limma installed, type help.start().
After some time a browser window should open. Click on
Packages | limma | Overview | LIMMA User's Guide (pdf)
John Maindonald email: [EMAIL PROTECTED]
phone : +61 2 (6125)3473fax  : +61 2(6125)5549
Centre for Bioinformation Science, Room 1194,
John Dedman Mathematical Sciences Building (Building 27)
Australian National University, Canberra ACT 0200.
On 14 Dec 2004, at 10:09 PM, [EMAIL PROTECTED] wrote:
From: Berton Gunter [EMAIL PROTECTED]
Date: 14 December 2004 9:23:08 AM
To: 'Andreas' [EMAIL PROTECTED], [EMAIL PROTECTED]
Cc: Subject: RE: [R] classification for huge datasets: SVM yields 
memory troubles

 I have a matrix with 30 observations and roughly 3
variables, ... snipped
Comment: This is ** not ** a huge data set -- it is a tiny one with a
large number of covariates. The difference is: If it were truly huge, 
SVM
and/or LDA or ... might actually be able to produce useful results. 
With so
few data and so many variables, it is hard to see how any approach 
that one
uses is not simply a fancy random number generator.

John Maindonald email: [EMAIL PROTECTED]
phone : +61 2 (6125)3473fax  : +61 2(6125)5549
Centre for Bioinformation Science, Room 1194,
John Dedman Mathematical Sciences Building (Building 27)
Australian National University, Canberra ACT 0200.
__
[EMAIL PROTECTED] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html

[R] Call to trellis.focus(); thenpanel.superpose()

2004-11-29 Thread John Maindonald

The following works fine with the x11 device, though it
may well be that an initial plot is overwritten.  With a pdf
or postscript device, I get two plots, the first of which
still has the red border from having the focus, while the
second is the plot that I want.
library(lattice); library(grid)
plt - xyplot(uptake ~ conc, groups=Plant, data=CO2)
print(plt)
trellis.focus(panel, row=1, column=1)
arglist=trellis.panelArgs()
arglist$type - l
do.call(panel.superpose, args=arglist)
trellis.unfocus()
Should I be able to use panel.superpose() in this way?
The new abilities provided by trellis.focus() etc add
greatly to the flexibility of what can be done with lattice
plots.  The grid-lattice combination is a great piece of
software.
John Maindonald email: [EMAIL PROTECTED]
phone : +61 2 (6125)3473fax  : +61 2(6125)5549
Centre for Bioinformation Science, Room 1194,
John Dedman Mathematical Sciences Building (Building 27)
Australian National University, Canberra ACT 0200.
__
[EMAIL PROTECTED] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html

Re: [R] Location of grobs etc on lattice output

2004-11-21 Thread John Maindonald

I'm puzzled about side effects of trellis.unfocus():
The following runs without problem, though grid.text() does not
seem to do anything.  (I'd thought that I had it working at one point.)
library(DAAG); library(lattice); library(grid)
cuckoos.strip - stripplot(species ~ length, xlab=, data=cuckoos)
cuckoos.bw - bwplot(species~length, xlab=Length of egg (mm),
 data=cuckoos)
vp0 - viewport(layout=grid.layout(2, 1))
pushViewport(vp0)
vp1 - viewport(layout.pos.row=1)
vp2 - viewport(layout.pos.row=2)
pushViewport(vp1)
print(cuckoos.strip,newpage=FALSE)
#   trellis.focus(panel, row=1, column=1, clip.off=TRUE)
grid.text(A, x=unit(0,native), y=unit(1.05,native),
  gp=gpar(fontsize=9))
#   trellis.unfocus()  ##  remove the following upViewport()
upViewport()
pushViewport(vp2)
print(cuckoos.bw, newpage=FALSE)
trellis.focus(panel, row=1, column=1, clip.off=TRUE)
grid.text(B, x=unit(0,native), y=unit(1.05,native),
  gp=gpar(fontsize=9))
trellis.unfocus()
If I remove the #'s, and remove the upViewport() that
follows the second #, I seem to lose the current tree,
as though the newpage=FALSE for the next print()
is ignored.  Should I be able to do something like this?
Clearly I do not understand what happens when
trellis.focus() is invoked.
This seems an area where an effective GUI, with a
graphical display of the viewport tree, could be very
helpful.
John Maindonald email: [EMAIL PROTECTED]
phone : +61 2 (6125)3473fax  : +61 2(6125)5549
Centre for Bioinformation Science, Room 1194,
John Dedman Mathematical Sciences Building (Building 27)
Australian National University, Canberra ACT 0200.
On 21 Nov 2004, at 3:41 PM, Deepayan Sarkar wrote:
On Saturday 20 November 2004 19:41, John Maindonald wrote:
Is there any way, after use of print.trellis(), to obtain the
co-ordinates of the plot region, e.g., in what are then the
native co-ordinates?
Have you read help(trellis.focus)? This is new in 2.0.0 and the
recommended API for interacting with lattice plots (you can of course
use grid tools directly, but details are more likely to change at that
level).
It hasn't had much testing, so I would appreciate reports of things 
that
should be doable easily but isn't.

e.g.
  library(DAAG)
  library(lattice); library(grid)
  data(cuckoos)
  pushViewport(viewport(layout=grid.layout(2, 1)))
  pushViewport(viewport(layout.pos.row=1))
  cuckoos.strip - stripplot(species ~ length, data=cuckoos)
  print(cuckoos.strip, newpage=FALSE)
  grid.text(A, x=unit(0.18,native), y=unit(0.925,native))
   # This works, but is fiddly, and needs rejigging if width
   # or fontsize are changed.
  popViewport(1)
An alternative would of course be to access the co-ordinate
system used by the lattice function for locating the panels,
or for locating labelling.
As in the example above, I have been using grid.text() to
position text outside the plot region, but closer to the top
axis than the legend parameter to the lattice function will
allow.
trellis.focus(panel, row=1, column=1, clip.off=TRUE)
will put you in the plot region (panel), but will switch off clipping 
so
you can write text outside.

You can also now control the amount of space between the axis and
legend, see
str(trellis.par.get(layout.heights))
Deepayan
__
[EMAIL PROTECTED] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! 
http://www.R-project.org/posting-guide.html
__
[EMAIL PROTECTED] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html

[R] Location of grobs etc on lattice output

2004-11-20 Thread John Maindonald

Is there any way, after use of print.trellis(), to obtain the
co-ordinates of the plot region, e.g., in what are then the
native co-ordinates?
e.g.
 library(DAAG)
 library(lattice); library(grid)
 data(cuckoos)
 pushViewport(viewport(layout=grid.layout(2, 1)))
 pushViewport(viewport(layout.pos.row=1))
 cuckoos.strip - stripplot(species ~ length, data=cuckoos)
 print(cuckoos.strip, newpage=FALSE)
 grid.text(A, x=unit(0.18,native), y=unit(0.925,native))
  # This works, but is fiddly, and needs rejigging if width
  # or fontsize are changed.
 popViewport(1)
An alternative would of course be to access the co-ordinate
system used by the lattice function for locating the panels,
or for locating labelling.
As in the example above, I have been using grid.text() to
position text outside the plot region, but closer to the top
axis than the legend parameter to the lattice function will
allow.
John Maindonald email: [EMAIL PROTECTED]
phone : +61 2 (6125)3473fax  : +61 2(6125)5549
Centre for Bioinformation Science, Room 1194,
John Dedman Mathematical Sciences Building (Building 27)
Australian National University, Canberra ACT 0200.
__
[EMAIL PROTECTED] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html

[R] The hidden costs of GPL software?

2004-11-18 Thread John Maindonald

The author of the article says nothing about the large
number of hours and weeks that he surely spent learning
S-plus!
There should be attention to the costs that arise from a wrong
or inappropriate analysis, perhaps because the software that
is in use makes it difficult to do anything better, perhaps
because of statistical skill limitations, often with these two
factors working together.  Analyses that misrepresent the
science, or designs and analyses that conspire together to
this end, have serious and costly implications for research.
I've refereed several papers recently, in broadly ecological
fields of endeavour, with seemingly quite reasonable data,
where the mix of author skill and abilities of the package was
clearly not up to the task in hand.  Relative to getting on top
of the statistical issues (for which they will probably end up
getting, as they need to, statistical help), the GUI/noGUI issue
will be a minor consideration, and hours or weeks spent
learning R will be at most a modest consideration.
John Maindonald email: [EMAIL PROTECTED]
phone : +61 2 (6125)3473fax  : +61 2(6125)5549
Centre for Bioinformation Science, Room 1194,
John Dedman Mathematical Sciences Building (Building 27)
Australian National University, Canberra ACT 0200.
On 17 Nov 2004, at 10:27 PM, [EMAIL PROTECTED] wrote:
From: Philippe Grosjean [EMAIL PROTECTED]
Date: 17 November 2004 8:53:28 PM
To: [EMAIL PROTECTED], [EMAIL PROTECTED]
Cc: Subject:
Hello,
In the latest 'Scientific Computing World' magazine (issue 78, p. 22), 
there
is a review on free statistical software by Felix Grant (doesn't have 
to
pay good money to obtain good statistics software). As far as I know, 
this
is the first time that R is even mentioned in this magazine, given 
that it
usually discuss commercial products.

In this article, the analysis of R is interesting. It is admitted that 
R is
a great software with lots of potentials, but: All in all, R was a 
good
lesson in the price that may have to be paid for free software: I 
spent many
hours relearning some quite basic things taken for granted in the 
commercial
package. Those basic things are releated with data import, obtention 
of
basic plots, etc... with a claim for a missing more intuitive GUI in 
order
to smooth a little bit the learning curve.

There are several R GUI projects ongoing, but these are progressing 
very
slowly. The main reason is, I believe, that a relatively low number of
programmers working on R are interested by this field. Most people 
wanting
such a GUI are basic user that do not (cannot) contribute... And if 
they
eventually become more knowledgeable, they tend to have other 
interests.

So, is this analysis correct: are there hidden costs for free software 
like
R in the time required to learn it? At least currently, for the people 
I
know (biologists, ecologists, oceanographers, ...), this is perfectly 
true.
This is even an insurmountable barrier for many of them I know, and 
they
have given up (they come back to Statistica, Systat, or S-PLUS using
exclusively functions they can reach through menus/dialog boxes).

Of course, the solution is to have a decent GUI for R, but this is a 
lot of
work, and I wonder if the intrinsic mechanism of GPL is not working 
against
such a development (leading to a very low pool of programmers actively
involved in the elaboration of such a GUI, in comparison to the very 
large
pool of competent developers working on R itself).

Do not misunderstand me: I don't give up with my GUI project, I am just
wondering if there is a general, ineluctable mechanism that leads to 
the
current R / R GUI situation as it stands,... and consequently to a 
general
rule that there are indeed most of the time hidden costs in free
software, due to the larger time required to learn it. I am sure there 
are
counter-examples, however, my feeling is that, for Linux, Apache, 
etc... the
GUI (if there is one) is often a way back in comparison to the 
potentials in
the software, leading to a steep learning curve in order to use all 
these
features.

I would be interested by your impressions and ideas on this topic.
Best regards,
Philippe Grosjean
__
[EMAIL PROTECTED] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html

Re: [R] highly biased PCA data?

2004-11-05 Thread John Maindonald

I'd suggest you start by using lda() or qda() from MASS,
benefits being that
(a) if the frequencies in the sample do not reflect the frequencies
in the target population, you can set 'prior' to mirror the target
frequencies.  The issue is, perhaps, is your odd person odd in
a 1000 dog : 100 cat owners : 10 fish population, or odd, e.g., in
a 1000:1000:50 population?  You can also vary the prior to see
what the effect is.  If however you set a large prior probability for
a group that is poorly represented, results will be 'noisy'.  Note
the use of 'classwt' for the prior probablities for randomForest().
(b) You can plot second versus first discriminant function scores,
to get a direct graphical representation of results.
Other discrimination techniques may have to use an ordination
technique or even lds() or qds() on a 2 dimensional representation
of results, in order to get a scatterplot.
[cf MDSplot() for randomForest()]
John Maindonald email: [EMAIL PROTECTED]
phone : +61 2 (6125)3473fax  : +61 2(6125)5549
Centre for Bioinformation Science, Room 1194,
John Dedman Mathematical Sciences Building (Building 27)
Australian National University, Canberra ACT 0200.
On 5 Nov 2004, at 10:18 PM, [EMAIL PROTECTED] wrote:
From: Berton Gunter [EMAIL PROTECTED]
Date: 5 November 2004 5:08:38 AM
To: 'Dan Bolser' [EMAIL PROTECTED], 'R-help' 
[EMAIL PROTECTED]
Cc: Subject: RE: [R] highly biased PCA data?

Dan:
1) There is no guarantee that PCA will show separate groups, of 
course, as
that is not its purpose, although it is frequently a side effect.

2) If you were to use a classification method of some sort 
(discriminant
analysis, neural nets, SVM's, model=based classification,  ...), my
understanding is that yes, indeed, severely unbalanced group membership
would, indeed, affect results. A guess is that Bayesian or other 
methods
that could explicitly model the prior membership probabilities would do
better. To make it clear why, suppose that there was a 99.9% 
preference of
dog and .05% each of the others. Than your datasets would have 
almost no
information on how covariates could distinguish the classes and the 
best
classifier would be to call everything a dog no matter what values 
the
covariates had.

I presume experts will have more and better to say about this.
-- Bert Gunter

[mailto:[EMAIL PROTECTED] On Behalf Of Dan Bolser
Sent: Thursday, November 04, 2004 9:41 AM
To: R mailing list
Subject: [R] highly biased PCA data?
Hello, supposing that I have two or three clear categories
for my data, lets say pet preferece across fish, cat, dog. Lets say 
most
people rate their preference as being mostly one of the categories.

I want to do pca on the data to see three 'groups' of people,
one group for fish, one for cat and one for dog. I would like to see
the odd person who likes both or all three in the (appropriate) 
middle of
the other main groups.

Will my data be affected by the fact that I have interviewed 1000 dog
owners, 100 cat owners and 10 fish owners? (assuming that
each scale of preference has an equal range).
Cheers,
dan.
__
[EMAIL PROTECTED] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html

[R] Using R for Data Analysis and Graphics ...

2004-10-21 Thread John Maindonald

A revised version of this document, now designed to reflect
version 2.0.0 of R, is available from CRAN sites, under
Documentation | Contributed.  Data sets are available
from my web page directory:
  http://wwwmaths.anu.edu.au/~johnm/r/dsets
Preferably, retrieve the image file usingR.Rdata
The output has, mostly, not been revised.  Thus it will in
some cases reflect an earlier version of R.  This is a task
for some later time.
John Maindonald email: [EMAIL PROTECTED]
phone : +61 2 (6125)3473fax  : +61 2(6125)5549
Centre for Bioinformation Science, Room 1194,
John Dedman Mathematical Sciences Building (Building 27)
Australian National University, Canberra ACT 0200.
__
[EMAIL PROTECTED] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html

Re: [R] Lattice .ps graphic is rotated in LaTeX slides

2004-10-02 Thread John Maindonald

On 2 Oct 2004, at 8:04 PM, [EMAIL PROTECTED] wrote:
On Fri, 2004-10-01 at 10:58, Peter Dalgaard wrote:
Michael Friendly [EMAIL PROTECTED] writes:
! Package graphics Error: Division by 0.
What am I doing wrong, or how could I do it differently so it would 
work?
You might try \usepackage{graphicx} instead. I seem to recall
(vaguely) getting better results with that sometimes.

That should be part of the preamble for using 'seminar', if it is setup
properly.
There is a decent tutorial for using seminar at:
http://astronomy.sussex.ac.uk/~eddie/soft/tutorial.html
There is also a great reference for including graphics in LaTeX:
www.ctan.org/tex-archive/info/epslatex.pdf
FWIW, though I have been using seminar for such presentations, I have
been recently looking at the Beamer package:
http://latex-beamer.sourceforge.net/
and of course, there is also the Prosper package:
http://prosper.sourceforge.net/
The one advantage of the Beamer package, for those that require it, is
that it supports pdflatex, which the others do not. Though, it can be
used with dvips/latex + ps2pdf, where needed.
HTH,
Marc
Note also the package pdfscreen, for use with pdflatex
www.river-valley.com/download2.shtml
This pretty much uses regular latex, with the page dimensions
changed and the font attributes redefined to make them
larger than usual inside the slide environment.  Be sure to
load the packages xspace and colortbl as well as pdfscreen.
John Maindonald email: [EMAIL PROTECTED]
phone : +61 2 (6125)3473fax  : +61 2(6125)5549
Centre for Bioinformation Science, Room 1194,
John Dedman Mathematical Sciences Building (Building 27)
Australian National University, Canberra ACT 0200.
__
[EMAIL PROTECTED] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html

Re: [R] Re: Thanks Frank, setting graph parameters, and why social scientists don't use R

2004-08-18 Thread John Maindonald

There are answers that could and should be applied in specific 
situations.  At least in academia and in substantial research teams, 
statisticians ought to have a prominent part in many of the research 
teams.  Senior statisticians should have a prominent role in deciding 
the teams to which this applies.  why should it be ok to do combine 
high levels of chemical expertise with truly appalling statistical 
misunderstandings, to the extent that the suppose chemical insights are 
not what they appear to be?

There should be a major focus on training application area students on 
training them to understand important ideas, to recognize when they are 
out of their depth, and to work with statisticians.

There should be much more use of statisticians in the refereeing of 
published papers.  Editors need to seek advice from experienced 
statisticians (some do) on what sorts of papers are candidates for 
statistical refereeing.

Publication in an archive of the data that have been used for a paper 
could be a huge help, so that others can check whether the data really 
do support the conclusion.  Even better, as Robert Gentleman has 
argued, would/will be papers that can be processed through Sweave or 
its equivalent.

Really enlightened people (in the statistical sense) in the applied 
communities will latch onto R, as some are doing, because the 
limitations inherent in much other software so often lead to crippled 
and/or misleading analyses.  Increasingly, we can hope that it will 
become difficult for statistics to in various applied area communities 
to proceed on its merry way, ignorant of or ignoring most of what has 
happened in the mainstream statistical community in the past 20 years.

The statistical community needs to be a lot more aggressive in 
demanding adequate standards of data analysis in applied areas, at the 
same time suggesting ways in which it can work with application area 
people to improve standards.

It is also fair to comment that the situation is very uneven.  There 
are some areas where the standards are pretty reasonable, at least for 
the types of problems that typically come up in those areas.
John Maindonald.

John Maindonald email: [EMAIL PROTECTED]
phone : +61 2 (6125)3473fax  : +61 2(6125)5549
Centre for Bioinformation Science, Room 1194,
John Dedman Mathematical Sciences Building (Building 27)
Australian National University, Canberra ACT 0200.
On 18 Aug 2004, Bert Gunter wrote:
So we see fairly frequently indications
of misunderstanding and confusion in using R. But the problem isn't R 
-- it's that users don't know enough statistics.

. . . .
I wish I could say I had an answer for this, but I don't have a clue. I 
do not thing it's fair to expect a mechnical engineer or psychologist 
or biologist to have the numerous math and statistical courses and 
experience in their training that would provide the base they need. For 
one thing, they don't have the time in their studies for this; for 
another, they may not have the background or interest -- they are, 
after all, mechanical engineers or biologists, not statisticians. 
Unfortunately, they could do their jobs as engineers and scientists a 
lot better if they did know more
statistics.  To me, it's a fundamental conundrum, and no one is to 
blame. It's just the reality, but it is the source for all kinds of 
frustrations on both sides of the statistical divide, which both you 
and Roger expressed in your own ways.
. . . .

__
[EMAIL PROTECTED] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html

[R] Re: R-help Digest, Vol 18, Issue 12

2004-08-12 Thread John Maindonald

The message for aov1 was Estimated effects may be unbalanced.  The  
effects are not unbalanced.  The design is 'orthogonal'.

The problem is that there are not enough degrees of freedom to estimate  
all those error terms.  If you change the model to:
  aov1 -  
aov(RT~fact1*fact2*fact3+Error(sub/(fact1+fact2+fact3)),data=myData)

or to
  aov2 -  
aov(RT~fact1*fact2*fact3+Error(sub/ 
((fact1+fact2+fact3)^2)),data=myData)

all is well.  This last model (aov2) seems to me to have an excessive  
number of error terms.

The lme model lme(RT~fact1*fact2*fact3, random=~1|sub, data=myData)
is equivalent to aov0 - aov(RT~fact1*fact2*fact3+Error(sub),  
data=myData)
It can be verified that the estimated variance components can be used  
to reproduce the mean squares in the anova table.

A conservative approach is to estimate e.g. contrasts involving fact1  
for each subject separately, then comparing these against SE estimates  
that have 4df (5 subjects -1).  This is the ultimate check -- are  
claimed effects consistent against the 5 subjects?  The standard errors  
should though, probably be based on some variety of averaged variance.

I do not know how to set up the equivalents of aov1 and aov2 (where the  
errors are indeed crossed) in lme4.

John Maindonald.
On 12 Aug 2004, at 8:03 PM, [EMAIL PROTECTED] wrote:
I will follow the suggestion of John Maindonald and present the  
problem by example with some data.

I also follow the advice to use mean scores, somewhat reluctantly  
though. I know it is common practice in psychology, but wouldnt it be  
more elegant if one could use all the data points in an analysis?

The data for 5 subjects (myData) are provided at the bottom of this  
message. It is a crossed within-subject design with three factors and  
reaction time (RT) as the dependent variable.

An initial repeated-measures model would be:
aov1-aov(RT~fact1*fact2*fact3+Error(sub/ 
(fact1*fact2*fact3)),data=myData)

Aov complains that the effects involving fact3 are unbalanced:
 aov1

Stratum 4: sub:fact3
Terms:
 fact3   Residuals
Sum of Squares  4.81639e-07 5.08419e-08
Deg. of Freedom   2   8
Residual standard error: 7.971972e-05
6 out of 8 effects not estimable
Estimated effects may be unbalanced

Presumably this is because fact3 has three levels and the other ones  
only two, making this a non-orthogonal design.

I then fit an equivalent mixed-effects model:
lme1-lme(RT~fact1*fact2*fact3,data=meanData2,random=~1|sub)
Subsequently I test if my factors had any effect on RT:
 anova(lme1)
 numDF denDF   F-value p-value
(Intercept)   144 105294024  .0001
fact1 14422  .0001
fact2 144 7  0.0090
fact3 24419  .0001
fact1:fact2   144 9  0.0047
fact1:fact3   244 1  0.4436
fact2:fact3   244 1  0.2458
fact1:fact2:fact3 244 0  0.6660
Firstly, why are the F-values in the output whole numbers?
The effects are estimated over the whole range of the dataset and this  
is so because all three factors are nested under subjects, on the same  
level. This, however, seems to inflate the F-values compared to the  
anova(aov1) output, e.g.
  anova(aov1)

Error: sub:fact2
 Df Sum SqMean Sq F value Pr(F)
fact2  1 9.2289e-08 9.2289e-08  2.2524 0.2078
Residuals  4 1.6390e-07 4.0974e-08


I realize that the (probably faulty) aov model may not be directly  
compared to the lme model, but my concern is if the lme estimation of  
the effects is right, and if so, how can a nave skeptic be convinced  
of this?

The suggestion to use simulate.lme() to find this out seems good, but  
I cant seem to get it working (from [R] lme: simulate.lme in R it  
seems it may never work in R).

I have also followed the suggestion to fit the exact same model with  
lme4. However, format of the anova output does not give me the  
estimation in the way nlme does. More importantly, the degrees of  
freedom in the denominator dont change, theyre still large:
 library(lme4)
 lme4_1-lme(RT~fact1*fact2*fact3,random=~1|sub,data=myData)
 anova(lme4_1)
Analysis of Variance Table

DfSum Sq   Mean Sq Denom F valuePr(F)   
fact1I1 2.709e-07 2.709e-0748 21.9205 2.360e-05  
***
fact2I1 9.229e-08 9.229e-0848  7.4665  0.008772 **
fact3L1 4.906e-08 4.906e-0848  3.9691  0.052047 .
fact3M1 4.326e-07 4.326e-0748 34.9972 3.370e-07 ***
fact1I:fact2I 1 1.095e-07 1.095e-0748  8.8619  0.004552 **
fact1I:fact3L 1 8.988e-10 8.988e-1048  0.0727  0.788577   
fact1I:fact3M 1 1.957e-08 1.957e-0848  1.5834  0.214351   
fact2I:fact3L 1 3.741e-09 3.741e-0948  0.3027  0.584749   
fact2I:fact3M 1 3.207e-08 3.207e-0848  2.5949  0.113767   
fact1I:fact2I:fact3L  1 2.785e-09 2.785e-09

Fwd: [R] Enduring LME confusion or Psychologists and Mixed-Effects

2004-08-11 Thread John Maindonald

In my undertstanding of the problem, the model
  lme1 - lme(resp~fact1*fact2, random=~1|subj)
should be ok, providing that variances are homogenous both between  
within subjects.  The function will sort out which factors  
interactions are to be compared within subjects,  which between 
subjects.  The problem with df's arises (for lme() in nlme, but not in 
lme4), when random effects are crossed, I believe.

It is difficult to give a general rule on the effect of imbalance; it 
depends on the relative contributions of the two variances and the 
nature of the imbalance.  There should be a rule that people who ask 
these sorts of questions are required to make their data available 
either (if the data set is small) as part of their message or (if data 
are extensive) on a web site.  Once the results of the analysis are on 
display, it is often possible to make an informed guess on the likely 
impact.  Use of simulate.lme() seems like a good idea.
John Maindonald.

On 11 Aug 2004, at 8:05 PM, [EMAIL PROTECTED] wrote:
From: Spencer Graves [EMAIL PROTECTED]
Date: 10 August 2004 8:44:20 PM
To: Gijs Plomp [EMAIL PROTECTED]
Cc: [EMAIL PROTECTED]
Subject: Re: [R] Enduring LME confusion or Psychologists and 
Mixed-Effects

 Have you considered trying a Monte Carlo?  The significance 
probabilities for unbalanced anovas use approximations.  Package nlme 
provides simulate.lme to facilitate this.  I believe this function 
is also mentioned in Pinheiro and Bates (2000).
 hope this helps.  spencer graves
p.s.  You could try the same thing in both library(nlme) and 
library(lme4).  Package lme4 is newer and, at least for most cases, 
better.
Gijs Plomp wrote:

Dear ExpeRts,
Suppose I have a typical psychological experiment that is a 
within-subjects design with multiple crossed variables and a 
continuous response variable. Subjects are considered a random 
effect. So I could model
 aov1 - aov(resp~fact1*fact2+Error(subj/(fact1*fact2))

However, this only holds for orthogonal designs with equal numbers of 
observation and no missing values. These assumptions are easily 
violated so I seek refuge in fitting a mixed-effects model with the 
nlme library.
 lme1 - lme(resp~fact1*fact2, random=~1|subj)

When testing the significance of the effects of my factors, with 
anova(lme1), the degrees of freedom that lme uses in the denominator 
spans all observations and is identical for all factors and their 
interaction. I read in a previous post on the list ([R] Help with 
lme basics) that this is inherent to lme. I studied the instructive 
book of Pinheiro  Bates and I understand why the degrees of freedom 
are assigned as they are, but think it may not be appropriate in this 
case. Used in this way it seems that lme is more prone to type 1 
errors than aov.

To get more conservative degrees of freedom one could model
 lme2 - lme(resp~fact1*fact2, random=~1|subj/fact1/fact2)
But this is not a correct model because it assumes the factors to be 
hierarchically ordered, which they are not.

Another alternative is to model the random effect using a matrix, as 
seen in [R] lme and mixed effects on this list.
 lme3 - (resp~fact1*fact2, random=list(subj=pdIdent(form=~fact1-1), 
 subj=~1,  fact2=~1)

This provides correct degrees of freedom for fact1, but not for the 
other effects and I must confess that I don't understand this use of 
matrices, Im not a statistician.

My questions thus come down to this:
1. When aovs assumptions are violated, can lme provide the right 
model for within-subjects designs where multiple fixed effects are 
NOT hierarchically ordered?

2. Are the degrees of freedom in anova(lme1) the right ones to 
report? If so, how do I convince a reviewer that, despite the large 
number of degrees of freedom, lme does provide a conservative 
evaluation of the effects? If not, how does one get the right denDf 
in a way that can be easily understood?

I hope that my confusion is all due to an ignorance of statistics and 
that someone on this list will kindly point that out to me. I do 
realize that this type of question has been asked before, but think 
that an illuminating answer can help R spread into the psychological 
community.

John Maindonald email: [EMAIL PROTECTED]
phone : +61 2 (6125)3473fax  : +61 2(6125)5549
Centre for Bioinformation Science, Room 1194,
John Dedman Mathematical Sciences Building (Building 27)
Australian National University, Canberra ACT 0200.
__
[EMAIL PROTECTED] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html

Re: [R] anti-R vitriol

2004-06-30 Thread John Maindonald

I am curious.  What were the dimensions of this data set?  Did this 
person know use read.table(), or scan().  Did they know about the 
possibility of reading the data one part at a time?

The way that SAS processes the data row by row limits what can be done. 
 It is often possible with scant loss of information, and more 
satisfactory, to work with a subset of the large data set or with 
multiple subsets.  Neither SAS (in my somewhat dated experience of it) 
nor R is entirely satisfactory for this purpose.  But at least in R, 
given a subset that fits so easily into memory that the graphs are not 
masses of black, there are few logistic problems in doing, rapidly and 
interactively, a variety of manipulations and plots, with each new task 
taking advantage of the learning that has gone before.  To do that well 
in the SAS world, it is necessary to use something like JMP or its 
equivalent in one of the newer modules, which process data in a way 
that is not all that different from R.

I have wondered about possibilities for a suite of functions that would 
make it easy to process through R data that is stored in one large data 
set, with a mix of adding a new variable or variables, repeating a 
calculation on successive subsets of the data, producing predictions or 
suchlike for separate subsets, etc. Database connections may be the way 
to go (c.f., the Ripley and Fei Chen paper at ISI 2003), but it might 
also be useful to have a simple set of functions that would handle some 
standard requirements.

John Maindonald.
On 30 Jun 2004, at 8:02 PM, Barry Rowlingson 
[EMAIL PROTECTED] wrote:

A colleague is receiving some data from another person. That person 
reads the data in SAS and it takes 30s and uses 64k RAM. That person 
then tries to read the data in R and it takes 10 minutes and uses a 
gigabyte of RAM. Person then goes on to say:

  It's not that I think SAS is such great software,
  it's not.  But I really hate badly designed
  software.  R is designed by committee.  Worse,
  it's designed by a committee of statisticians.
  They tend to confuse numerical analysis with
  computer science and don't have any idea about
  software development at all.  The result is R.
  I do hope [your colleague] won't have to waste time doing
  [this analysis] in an outdated and poorly designed piece
  of software like R.
Would any of the committee like to respond to this? Or shall we just 
slap our collective forehead and wonder how someone could get such a 
view?

John Maindonald email: [EMAIL PROTECTED]
phone : +61 2 (6125)3473fax  : +61 2(6125)5549
Centre for Bioinformation Science, Room 1194,
John Dedman Mathematical Sciences Building (Building 27)
Australian National University, Canberra ACT 0200.
__
[EMAIL PROTECTED] mailing list
https://www.stat.math.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html

Re: [R] terminology for frames and environments

2004-06-14 Thread John Maindonald

From: Peter Dalgaard [EMAIL PROTECTED]
Date: 14 June 2004 5:42:29 PM
Gabor Grothendieck [EMAIL PROTECTED] writes:

...
Could someone please clarify what standard terminology is?
Not sure there is one...
We have been approaching consensus on a couple of occasions, but
(obviously) not been too good at enforcing it. I think the consensus
is that a frame is a set of variable bindings (implemented as a
hashed list), an environment is a frame plus an enclosing environment,
i.e. a linked list of frames, terminated by NULL. It is occasionally
necessary to refer to the individual frames as opposed to the whole
list, which is exactly the point of the inherits argument.
Notice that exists() talks about enclosing which is only ever used
in sense #1 above. parent is used in both senses (which is a bit
unfortunate -- not quite sure whether we have decided to get rid of
parent.env() eventually).
--
   O__   Peter Dalgaard Blegdamsvej 3
  c/ /'_ --- Dept. of Biostatistics 2200 Cph. N
 (*) \(*) -- University of Copenhagen   Denmark  Ph: (+45) 35327918
~~ - ([EMAIL PROTECTED]) FAX: (+45) 35327907
I have found it helpful, in trying to explain (to myself and others) 
what happens, to say that there is both a lexical stack and a call 
stack.  Is that a legitimate use of terminology?

John Maindonald email: [EMAIL PROTECTED]
phone : +61 2 (6125)3473fax  : +61 2(6125)5549
Centre for Bioinformation Science, Room 1194,
John Dedman Mathematical Sciences Building (Building 27)
Australian National University, Canberra ACT 0200.
__
[EMAIL PROTECTED] mailing list
https://www.stat.math.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html

Re: [R] p-values

2004-05-02 Thread John Maindonald

Frank -
Thanks for your reply, on which I really have no comment.
There are contexts where a Bayesian approach necessary,
natural and easy to handle, and can be used to broaden
the inferential vision of students.  Examples from HIV
testing and mammography screening in
Gigerenzer's book, sold in the USA under the title:
Calculated Risks: How to Know When Numbers Deceive You
(Simon  Schuster)
can be used to make highly important points.
I have, in the section on inference in my book with John Braun,
injected some of this into the discussion of inference.
[Incidentally, Frank's book does have a great deal of
practically oriented comment (e.g., wrt variable selection)
that I have found useful. There is useful critical comment
on inappropriate use of p-values in such contexts.]
John Maindonald.
On 3 May 2004, at 9:58 AM, Frank E Harrell Jr wrote:

I'm sorry to have taken so long in responding to your excellent 
question
John.  And I'm responding to r-help since the question was posed there
before taking the discussion offline.

In the words of Don Berry It takes time to be a Bayesian and that's 
the
main reason there are no explicit Bayesian calculations in the book.  
I do
make a lot of use of prior information though.  In the future I could 
see
making some additions to the book along the lines of inclusion of 
examples
using the MCMCpack package, whose design is appealing to me.

R provides a lot of help for those who want a frequentist
interpretation, even to including by default the *, **, ***
labeling that some of us deplore.  There is no similar help
for those who want at least the opportunity to place the
output from a modeling exercise in a Bayesian context of
some description.  There is surely a strong argument for
the use of a more neutral form of default output, even to
the excluding of p-values, on the argument that they also
push too strongly in the direction of a frequentist
interpretative framework.
Agreed.  I do try to approximate the Bayesian approach with the 
bootstrap.

There seems, unfortunately, to be a dearth of good ideas
on how the assist the placing of output from modeling
functions such as R provides in an explicitly Bayesian
framework.  . . . .
It's all worth pursuing.  I wish there were already a Bayesian package
that made use of Bayesian methods irresistable.
All the best,
Frank
John Maindonald email: [EMAIL PROTECTED]
phone : +61 2 (6125)3473fax  : +61 2(6125)5549
Centre for Bioinformation Science, Room 1194,
John Dedman Mathematical Sciences Building (Building 27)
Australian National University, Canberra ACT 0200.
__
[EMAIL PROTECTED] mailing list
https://www.stat.math.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html

[R] p-values

2004-04-28 Thread John Maindonald

The Bayesian framework is surely a good framework
for thinking about inference, and for exploring common
misinterpretations of p-values.  P-values are surely
unhelpful and to be avoided in cases where there is
`strong' prior evidence.  I will couch the discussion that
follows in terms of confidence intervals, which makes
the discussion simpler, rather than in terms of p-values.
The prior evidence is in my sense strong if it leads to a
Bayesian credible interval that is very substantially
different from the frequentist confidence interval
(though I prefer the term `coverage interval').
Typically the intervals will be similar if a diffuse prior
is used, i.e., all values over a wide enough range are,
on some suitable scale, a-priori equally likely.  This is,
in my view, the message that you should take from your
reading.
Examples of non-diffuse priors are what Berger focuses on.
Consider for example his discussion of one of Jeffreys'
analyses, where Jeffreys puts 50% of the probability on
on a point value of a a continuous parameter, i.e., there is
a large spike in the prior at that point.
Berger commonly has scant commentary on the specific
features of his priors that make the Bayesian results seem
very different (at least to the extent of having a different feel)
from the frequentist results. His paper in vol 18, no 1 of
Statistical Science (pp.1-32; pp.12-27 are comment from
other) seems more judicious in this respect than some of
his earlier papers.
It is interesting to speculate how R's model fitting routines
might be tuned to allow a Bayesian interpretation.  What
family or families of priors would be on offer, and/or used by
default? What default mechanisms would be suitable 
useful for indicating the sensitivity of results to the choice
of prior?
John Maindonald.

From: Greg Tarpinian [EMAIL PROTECTED]
Date: 28 April 2004 6:32:06 AM
To: [EMAIL PROTECTED]
Subject: [R] p-values
I apologize if this question is not completely
appropriate for this list.
I have been using SAS for a while and am now in the
process of learning some C and R as a part of my
graduate studies.  All of the statistical packages I
have used generally yield p-values as a default output
to standard procedures.
This week I have been reading Testing Precise
Hypotheses by J.O. Berger  Mohan Delampady,
Statistical Science, Vol. 2, No. 3, 317-355 and
Bayesian Analysis: A Look at Today and Thoughts of
Tomorrow by J.O. Berger, JASA, Vol. 95, No. 452, p.
1269 - 1276, both as supplements to my Math Stat.
course.
It appears, based on these articles, that p-values are
more or less useless.  If this is indeed the case,
then why is a p-value typically given as a default
output?  For example, I know that PROC MIXED and
lme( ) both yield p-values for fixed effects terms.
The theory I am learning does not seem to match what
is commonly available in the software, and I am just
wondering why.
Thanks,
Greg
John Maindonald email: [EMAIL PROTECTED]
phone : +61 2 (6125)3473fax  : +61 2(6125)5549
Centre for Bioinformation Science, Room 1194,
John Dedman Mathematical Sciences Building (Building 27)
Australian National University, Canberra ACT 0200.
__
[EMAIL PROTECTED] mailing list
https://www.stat.math.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html

Re:[R] p-values

2004-04-28 Thread John Maindonald

This is, of course, not strictly about R.  But if there should be
a decision to pursue such matters on this list, then we'd need
another list to which such discussion might be diverted.
I've pulled Frank's Regression Modeling Stratregies down
from my shelf and looked to see what he says about
inferential issues.  There is a suggestion, in the introduction,
that modeling provides the groundwork that can be used a
point of departure for a variety of inferential interpretations.
As far as I can see Bayesian interpretations are never
really explicitly discussed, though the word Bayesian does
appear in a couple of places in the text.  Frank, do you now
have ideas on how you would (perhaps, in a future edition,
will) push the discussion in a more overtly Bayesian direction?
What might be the style of a modeling book, aimed at practical
data analysts who of necessity must (mostly, at least) use
off-the-shelf software, that seriously entertains the Bayesian
approach?
R provides a lot of help for those who want a frequentist
interpretation, even to including by default the *, **, ***
labeling that some of us deplore.  There is no similar help
for those who want at least the opportunity to place the
output from a modeling exercise in a Bayesian context of
some description.  There is surely a strong argument for
the use of a more neutral form of default output, even to
the excluding of p-values, on the argument that they also
push too strongly in the direction of a frequentist
interpretative framework.
There seems, unfortunately, to be a dearth of good ideas
on how the assist the placing of output from modeling
functions such as R provides in an explicitly Bayesian
framework.  Or is it, at least in part, that I am unaware of
what is out there? That, I guess, is the point of my
question to Frank.  Is it just too technically demanding
to go much beyond trying to get users to understand
that a Bayesian credible interval can, if there is an
informative prior, be very different from a frequentist CI,
that they really do need to pause if there is an
informative prior lurking somewhere in the undergrowth?
John Maindonald.
Frank Harrell wrote:
They [p-values] are objective only in the sense that
subjectivity is deferred in a difficult to document way
when P-values are translated into decisions.

The statement that frequentist methods are the norm, which I'm
afraid is usually true, is a sad comment on the state of much
of scientific inquiry.  IMHO P-values are so defective that
the imperfect Bayesian approach should be seriously entertained.
John Maindonald email: [EMAIL PROTECTED]
phone : +61 2 (6125)3473fax  : +61 2(6125)5549
Centre for Bioinformation Science, Room 1194,
John Dedman Mathematical Sciences Building (Building 27)
Australian National University, Canberra ACT 0200.
__
[EMAIL PROTECTED] mailing list
https://www.stat.math.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html

[R] Solutions to Exercises - Data Analysis Graphics Using R

2004-03-21 Thread John Maindonald

This message is aimed at anyone who may be using
exercises from my book with John Braun for teaching
purposes.  I am using this channel of communication
in the absence of any other obvious effective channel.
I ask the forbearance of list members.
Our intention is to post solutions to selected exercises
(the more challenging exercises) on the web, via a link
from http://wwwmaths.anu.edu.au/~johnm/r-book.html
Anyone who would find this inconvenient should contact
me directly.
I will shortly post a provisional list on the above web site.

John Maindonald email: [EMAIL PROTECTED]
phone : +61 2 (6125)3473fax  : +61 2(6125)5549
Centre for Bioinformation Science, Room 1194,
John Dedman Mathematical Sciences Building (Building 27)
Australian National University, Canberra ACT 0200.
__
[EMAIL PROTECTED] mailing list
https://www.stat.math.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html

[R] simple test of lme, questions on DF corrections

2003-03-31 Thread John Maindonald

Dear Greg -
My understanding is that method=ML changes only the
method used to fit the model.  The parameter estimates
are not the ML parameter estimates.  The fitted values
are the fitted values for ML.
The easiest way to get the anova sum of squares is to specify:
 fm1.aov - aov(travel~1+Error(factor(Rail)),data=Rail)
 summary(fm1.aov)
Error: factor(Rail)
  Df Sum Sq Mean Sq F value Pr(F)
Residuals  5 9310.5  1862.1
Error: Within
  Df  Sum Sq Mean Sq F value Pr(F)
Residuals 12 194.000  16.167
This compares with:
 fm1.lme - lme(travel~1, random=~1|Rail,data=Rail,method=ML)
 sum(fm1.lme$residuals[,2]^2)
[1] 195.0106
 sqrt(195.0106/12)
[1] 4.031238
Consistent with this, the predicted values that are obtained with
 predict(fm1.lme,level=1)
(the Best Linear Unbiased Predictors, or BLUPs)
are not the group means that you can get from:
 predict(aov(travel~Rail,data=Rail))
Thus a residual mean square of 194/12 has become, in the ML fit,
195.0106/12.  This change in the predicted values (BLUPs), in the
residuals, and in the sum of squares of residuals, arises because
the likelihood is now being maximised for a model where there
are two variance parameters that must be estimated.  Notice that
the BLUPs are pulled back towards the overall mean, relative to
the group means.
NB also, specify level=1 to incorporate the random group (Rail)
effects into the predicted values.
John Maindonald email: [EMAIL PROTECTED]
phone : +61 2 (6125)3473fax  : +61 2(6125)5549
Centre for Bioinformation Science, Room 1194,
John Dedman Mathematical Sciences Building (Building 27)
Australian National University, Canberra ACT 0200.
 
-
Date: Sat, 29 Mar 2003 20:19:33 -0500
From: Greg Hammett [EMAIL PROTECTED]
Subject: [R] simple test of lme, questions on DF corrections
To: [EMAIL PROTECTED]
Message-ID: [EMAIL PROTECTED]
Content-Type: text/plain; charset=US-ASCII

I'm a physicist working on fusion energy and dabble in statistics
only occasionally, so please excuse gaps in my statistical
knowledge.  I'd appreciate any help that a real statistics expert
could provide.  Most people in my field do only very simple
statistics, and I am trying to extend some work on multivariate
linear regression to account for significant between-group
correlated errors.  It looks to me like the lme routine in the
nlme package does most of what I want. (My congrats to the
developers of R and nlme, they are extremely useful!).
To summarize my questions: I've tested lme on a very simple test
case (with just 1 parameter, the mean, and 1 random effect), but
the results are somewhat different than I get from some simple
maximum-likelhood formulas.  It appears that some quantities
calculated by lme are corrected for the reduction in the degrees
of freedom due to the number of fit parameters.  But these
corrections are slightly different (0.3%-3%) than what I would
have expected, and I'd like to understand these differences
better.  .
 
---

__
[EMAIL PROTECTED] mailing list
https://www.stat.math.ethz.ch/mailman/listinfo/r-help

[R] (no subject)

2003-02-25 Thread John Maindonald

Let's assume that the columns of the model matrix, apart perhaps
from an initial column that corresponds to the overall mean, have
been centred.  Then:
1) Normal equation methods give an accurate fit to the matrix
of centred sums of squares and products.
2) QR methods give an accurate fit to the predicted values.
QR will give better precision than normal equation methods
(e.g., Cholesky) if there are substantial correlations between
the columns of the model matrix.  This is because sequential
normal equations methods successively modify the centred
sums of squares and products (CSSP) matrix to be a
representation of the matrix of sums of squares and  products
of partial residuals as columns of the model matrix are partialed
out in turn.  QR directly modifies a representation of the partial
residuals themselves.
If columns of the model matrix are almost uncorrelated then
normal equation methods may however give the better precision,
essentially because the CSSP matrix does not change much and
normal equation methods require fewer arithmetic operations.
In the situations where QR gives substantially better precision,
the correlations between columns of the model matrix will mean
that some regression coefficients have a large standard error.
The variance inflation factor for some regression coefficients
will be large.  Will the additional precision be meaningful?
The question has especial point now that double precision
is standardly used.
I think it useful to expose students to both classes of method.
In contexts where QR gives results that are numerically
more precise, I'd encourage them to examine the variance
inflation factors (they should examine them anyway).  It is
often a good idea, if the VIFs are large, to consider whether
there is a simple re-parameterization [perhaps as simple
as replacing x1 and x2 by (x1+x2) and (x1-x2)] where
correlations create less havoc.  This may be an important
issue for interpretation, even if the increased numerical
accuracy serves no useful purpose.
 
--
Date: Mon, 24 Feb 2003 13:50:31 -0500
From: Chong Gu [EMAIL PROTECTED]
Not only it's unfair criticism, it's probably also imprecise
information.
For a detailed discussion of the precisions of regression estimates
through QR-decomposition and normal equations, one may consult Golub
and Van Loan's book on Matrix Computation (1989, Section 5.3.9 on page
230).  QR takes twice as much computation, requires more memory, but
does NOT necessarily provide better precision.
The above said, I am not questioning the adequacy of the QR approach
to regression calculation as implemented in R.
That's an unfair criticism. That discussion was never intended as
a recommendation for how to compute a regression. Of course, SVD or
QR decompositions are the preferred method but many newbies don't  
want to
digest all that right from the start. These are just obscure details  
to
the beginner.

One of the strengths of R in teaching is that students can directly
implement the formulae from the theory. This reinforces the connection
between theory and practice. Implementing the normal equations  
directly
is a quick early illustration of this connection. Explaining the  
precise
details of how to fit a regression model is something that can be
deferred.

Julian Faraway

I am just about working through Faraways excellent tutorial  
practical
regression and ANOVA using R
I assume this is a reference to the PDF version available via CRAN.  
I am
afraid that is *not* a good discussion of how to do regression,
especially
not using R.  That page is seriously misleading: good ways to compute
regressions are QR decompositions with pivoting (which R uses) or an  
SVD.
Solving the normal equations is well known to square the condition
number,
and is close to the worse possible way.  (If you must use normal
equations, do at least centre the columns, and preferably do some
scaling.)
on page 24 he makes the x matrix:
x - cbind(1,gala[,-c(1,2)])
how can I understand this gala[,-c(1,2)])... I couldn't find an
explanation of such c-like abbreviations anywhere.
Well, it is in all good books (as they say) including `An  
Introduction to
R'. (It's even on page 210 of that book!)

-c(1,2) is (try it)

-c(1,2)
[1] -1 -2

so this drops columns 1 and 2.  It then adds in front a column made  
up of
ones, which is usually a sign of someone not really understanding how
R's linear models work.
__
[EMAIL PROTECTED] mailing list
http://www.stat.math.ethz.ch/mailman/listinfo/r-help
John Maindonald email: [EMAIL PROTECTED]
phone : +61 2 (6125)3473fax  : +61 2(6125)5549
Centre for Bioinformation Science, Room 1194,
John Dedman Mathematical Sciences Building (Building 27)
Australian National University, Canberra ACT 0200.
__
[EMAIL PROTECTED] mailing list
http

Re: [R] acceptable p-level for scientific studies

2002-12-18 Thread John Maindonald

Another interesting paper is:

Gigerenzer, G. 1998. We need statistical thinking, not statistical
rituals. Behavioural and Brain Sciences 21: 199-200.

It is interesting that Gigerenzer's recent, highly readable and
disturbing book :Calculated Risks (US: Simon  Schuster) or
Reckoning with Risk (UK, Allen Lane) has no reference to
p-values in the index.

Gigerenzer describes an investigation where just one AIDS
counsellor out of 20 showed any recognition that the estimate
of the probability of infection, given a positive AIDS test, would
depend on the risk group from which the person came.
The very small P[positive test | no infection] is an incomplete
part of the story.

Regrettably I suspect that the result would be much the same for
AIDS or genetics counsellors anywhere.

John Maindonald email: [EMAIL PROTECTED]
phone : +61 2 (6125)3473fax  : +61 2(6125)5549
Centre for Bioinformation Science, Room 1194,
John Dedman Mathematical Sciences Building (Building 27)
Australian National University, Canberra ACT 0200.

__
[EMAIL PROTECTED] mailing list
http://www.stat.math.ethz.ch/mailman/listinfo/r-help

54 matches

Mail list logo