from:"Thomas Lumley"

Re: [R] Survey package

2007-09-10 Thread Thomas Lumley

On Thu, 6 Sep 2007, eugen pircalabelu wrote:

 Good afternoon!

  I'm trying to use the Survey package for a stratified sample which has 
 4 criteria on which the stratification is based. I would like to get the 
 corrected weights and for every element i get a weight of 1

  E.g: tipping

   design - svydesign (id=~1, strata= ~regiune + size_loc + age_rec_hhh 
 + size_hh, data= tabel)
   and then  weights(design)
  gives me:  1,1,1,1,1,1,1,1,1,1,1,... for each element


There are two problems.  The first is that you have the wrong syntax for 
strata.  If you have one stage of sampling with multiple stratifying 
factors you need to create a single factor representing the strata. One 
way is with interaction()

design - svydesign (id=~1, strata= ~interaction(regiune, size_loc,  
age_rec_hhh, size_hh), data= tabel)

Second, you have not specified either weights or population sizes, so R 
has no way to work out the sampling weights. That's why you get weights of 
1.  You should also get a warning.

-thomas

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] the survey package

2007-09-10 Thread Thomas Lumley

On Thu, 6 Sep 2007, Tobias Verbeke wrote:

 eugen pircalabelu wrote:

   I'm trying to use the survey package to get a better point of view 
 for my data, but i need some piece of advice:

   i have some data from a survey which has been stratified using 2 
 criteria: region(7 values), size of locality(5 values)  Using the 
 survey pakage how can i define in a correct way this design (taking 
 into account all 4 strata not just one as in the Survey example)

snip
 According to ?svydesign, strata is a formula.

 The following should work (untested):

 design - svydesign(ids=~0, strata=~regiune + size_loc, data=tabel)

This would be a two-stage sample, you actually need ~interaction(regiune, 
size_loc).

[this reply is just to make sure it ends up linked in the archives].

-thomas

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Are the error messages of ConstrOptim() consisten with each other?

2007-09-10 Thread Thomas Lumley

*1000.0)))*mat+lambda*lambda))-((ss+(tot/sh*1000.0)*lbar)/(tot/sh*1000.0)/lbar*exp(lambda*lambda))*pnorm(-sqrt((sigmae*ss/(ss+lbar*(tot/sh*1000.0)))*(sigmae*ss/(ss+lbar*(tot/sh*1000.0)))*mat+lambda*lambda)/2.0-log(((ss+(tot/sh*1000.0)*lbar)/(tot/sh*1000.0)/lbar*exp(lambda*lambda)))/sqrt((sigmae*ss/(ss+lbar*(tot/sh*1000.0)))*(sigmae*ss/(ss+lbar*(tot/sh*1000.0)))*mat+lambda*lambda)))*exp(-rint*mat)-(exp(rint*(lambda*lambda/(sigmae*ss/(ss+lbar*(tot/sh*1000.0)))/(sigmae*ss/(ss+lbar*(tot/sh*1000.0)*ss+(tot/sh*1000.0)*lbar)/(tot/sh*1000.0)/lbar*exp(lambda*lambda))^(sqrt(0!
 .!
 25+2.0*rint/(sigmae*ss/(ss+lbar*(tot/sh*1000.0)))/(sigmae*ss/(ss+lbar*
 (tot/sh*1000.0+0.5)*pnorm(-log(((ss+(tot/sh*1000.0)*lbar)/(tot/sh*1000.0)/lbar*exp(lambda*lambda)))/((sigmae*ss/(ss+lbar*(tot/sh*1000.0)))*sqrt(mat+(lambda*lambda/(sigmae*ss/(ss+lbar*(tot/sh*1000.0)))/(sigmae*ss/(ss+lbar*(tot/sh*1000.0))-sqrt(0.25+2.0*rint/(sigmae*ss/(ss+lbar*(tot/sh*1000.0)))/(sigmae*ss/(ss+lbar*(tot/sh*1000.0*(sigmae*ss/(ss+lbar*(tot/sh*1000.0)))*sqrt(mat+(lambda*lambda/(sigmae*ss/(ss+lbar*(tot/sh*1000.0)))/(sigmae*ss/(ss+lbar*(tot/sh*1000.0))+((ss+(tot/sh*1000.0)*lbar)/(tot/sh*1000.0)/lbar*exp(lambda*lambda))^(-sqrt(0.25+2.0*rint/(sigmae*ss/(ss+lbar*(tot/sh*1000.0)))/(sigmae*ss/(ss+lbar*(tot/sh*1000.0+0.5)*pnorm(-log(((ss+(tot/sh*1000.0)*lbar)/(tot/sh*1000.0)/lbar*exp(lambda*lambda)))/((sigmae*ss/(ss+lbar*(tot/sh*1000.0)))*sqrt(mat+(lambda*lambda/(sigmae*ss/(ss+lbar*(tot/sh*1000.0)))/(sigmae*ss/(ss+lbar*(tot/sh*1000.0))+sqrt(0.25+2.0*rint/(sigmae*ss/(ss+lbar*(tot/sh*1000.0)))/(sigmae*ss/(ss+lbar*(tot/sh*1000.0*(sigmae*ss/(ss+!
 lb!
 ar*(tot/sh*1000.0)))*sqrt(mat+(lambda*lambda/(sigmae*ss/(ss+lbar*(tot/sh*1000.0)))/(sigmae*ss/(ss+lbar*(tot/sh*1000.0)))-(((ss+(tot/sh*1000.0)*lbar)/(tot/sh*1000.0)/lbar*exp(lambda*lambda))^(sqrt(0.25+2.0*rint/(sigmae*ss/(ss+lbar*(tot/sh*1000.0)))/(sigmae*ss/(ss+lbar*(tot/sh*1000.0+0.5)*pnorm(-log(((ss+(tot/sh*1000.0)*lbar)/(tot/sh*1000.0)/lbar*exp(lambda*lambda)))/((sigmae*ss/(ss+lbar*(tot/sh*1000.0)))*sqrt((lambda*lambda/(sigmae*ss/(ss+lbar*(tot/sh*1000.0)))/(sigmae*ss/(ss+lbar*(tot/sh*1000.0))-sqrt(0.25+2.0*rint/(sigmae*ss/(ss+lbar*(tot/sh*1000.0)))/(sigmae*ss/(ss+lbar*(tot/sh*1000.0*(sigmae*ss/(ss+lbar*(tot/sh*1000.0)))*sqrt((lambda*lambda/(sigmae*ss/(ss+lbar*(tot/sh*1000.0)))/(sigmae*ss/(ss+lbar*(tot/sh*1000.0))+((ss+(tot/sh*1000.0)*lbar)/(tot/sh*1000.0)/lbar*exp(lambda*lambda))^(-sqrt(0.25+2.0*rint/(sigmae*ss/(ss+lbar*(tot/sh*1000.0)))/(sigmae*ss/(ss+lbar*(tot/sh*1000.0+0.5)*pnorm(-log(((ss+(tot/sh*1000.0)*lbar)/(tot/sh*1000.0)/lbar*exp(lambda!
 *!
 lambda)))/((sigmae*ss/(ss+lbar*(tot/sh*1000.0)))*sqrt((lambda*lambda/(
 sigmae*ss/(ss+lbar*(tot/sh*1000.0)))/(sigmae*ss/(ss+lbar*(tot/sh*1000.0))+sqrt(0.25+2.0*rint/(sigmae*ss/(ss+lbar*(tot/sh*1000.0)))/(sigmae*ss/(ss+lbar*(tot/sh*1000.0*(sigmae*ss/(ss+lbar*(tot/sh*1000.0)))*sqrt((lambda*lambda/(sigmae*ss/(ss+lbar*(tot/sh*1000.0)))/(sigmae*ss/(ss+lbar*(tot/sh*1000.0)))

 apple.ana= function(rec1,lambda1,lbar1)
 {rec=rec1
 lambda=lambda1
 lbar=lbar1
 apple=eval(apple2)
 gradient=cbind(eval(D(apple2,'rec')),eval(D(apple2,'lambda')),
 eval(D(apple2,'lbar')))
 attr(apple.ana,'gradient')=gradient
 apple
 }

 fit.error=function(rec1,lambda1,lbar1)
 {rec=rec1
 lambda=lambda1
 lbar=lbar1
 sum((eval(apple2)*1000-orange)^2/(orange^2))
 }

 fit.error.grr=function(rec1,lambda1,lbar1)
 {rec=rec1
 lambda=lambda1
 lbar=lbar1

 drec=sum(2*eval(D(apple2,'rec'))*(eval(apple2)*1-orange)/(orange^2))
 dlambda=sum(2*eval(D(apple2,'lambda'))*(eval(apple2)*1-orange)/(orange^2))
 dlbar=sum(2*eval(D(apple2,'lbar'))*(eval(apple2)*1-orange)/(orange^2))

 c(drec,dlambda,dlbar)
 }

 ui=matrix(c(1,-1,0,0,0,0,0,0,1,-1,0,0,0,0,0,0,1,-1),6,3)
 ci=c(0,-0.5,0,-2,0,-0.6)

 constrOptim(c(0.5,0.3,0.5), f=fit.error, gr=fit.error.grr, ui=ui,ci=ci)
 constrOptim(c(0.5,0.9,0.5), f=fit.error, gr=fit.error.grr, ui=ui,ci=ci)
 constrOptim(c(0.3,0.5,0.5), f=fit.error, gr=fit.error.grr, ui=ui,ci=ci)

 ###

 __
 R-help@stat.math.ethz.ch mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.


Thomas Lumley   Assoc. Professor, Biostatistics
[EMAIL PROTECTED]   University of Washington, Seattle

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Survey package

2007-09-10 Thread Thomas Lumley

On Sun, 9 Sep 2007, eugen pircalabelu wrote:
 A short example:

 stratum id weight nh Nh  y sex
 1  1  3 5 15 23   1
 1  2  3 5 15 25   1
 1  3  3 5 15 27   2
 1  4  3 5 15 21   2
 1  5  3 5 15 22   1
 2  6  4 3 12 33   1
 2  7  4 3 12 27   1
 2  8  4 3 12 29   2

 where nh is size of sample stratum and Nh the corresponding  population 
 value, and  y is  metric variable.

 Now if i let

 design - svydesign( id=~1, data=age, strata=~stratum, fpc=~Nh)
 then weights(design)  gives me 3,3,3,3,3,4,4,4.

 If i then let

 x- postStratify( design, strata=~sex, data.frame(sex=c(1,2), 
 freq=c(10,15)))
 the weights become

 123456
 78
 2.17   2.17   5.35   5.352.171.731.73 
4.28

 If i define

 design - svydesign( id=~1, data=age )
 x- postStratify( design, strata=~sex, data.frame(sex=c(1,2), 
 freq=c(10,15)))
 weights become  2 2 5 5 2 2 2 5

 The question: does poststratify recognize that i have already stratified 
 in the first design by stratum and then it post stratifies by sex? and 
 why is that? (because i don't have the full joint distribution, the 
 sex*stratum crossing, in order to apply correctly the post stratify 
 function) I see that Mr Lumley uses the postStratify function when the 
 design does not include strata (eg from ?poststratify:


This gives you a design stratified by stratum and post-stratified by sex, 
which is not the same as stratifying by stratum*sex or post-stratifying by 
stratum*sex.

In this case you should probably rake() on stratum and sex rather than 
just post-stratifying. Post-stratifying on sex is equivalent to one iteration 
of the iterative proportional fitting algorithm used in raking.

-thomas

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] mode or parameters of readBin

2007-09-10 Thread Thomas Lumley

On Mon, 10 Sep 2007, Sigbert Klinke wrote:

 Hi,

  sapply(formals(readBin), mode)
  con  what n  sizesignedendian
   namename numeric logical logicalcall

 returns for the mode of size logical. But in the documentation is said
 that size should be integer. Does anyone know why the mode is logical?


Because NA is a logical constant.

-thomas

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Excel (off-topic, sort of)

2007-08-30 Thread Thomas Lumley

On Wed, 29 Aug 2007, Alberto Monteiro wrote:

 Do you know what's in my wish list?

 I wish spreadsheets and computer languages had gone one
 step further.

 I mean, it's nice to define Cell X to be equal to
 Cell Y + 10, and then when we change Cell Y, magically we
 see Cell X change.

 But why can't it be the reverse? Why can't I change Cell X
 _and see the change in Cell Y_?


Well, most statistical calculations are data-reducing, ie, many-to-one.

I don't think you are likely to find a language where you can change the 
confidence limits for the sample mean and have the data change in 
response.


-thomas

Thomas Lumley   Assoc. Professor, Biostatistics
[EMAIL PROTECTED]   University of Washington, Seattle

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] subset using noncontiguous variables by name (not index)

2007-08-27 Thread Thomas Lumley

On Mon, 27 Aug 2007, Muenchen, Robert A (Bob) wrote:

 Gabor, That works great!

 I think this would be a very helpful addition to the main R
 distribution. Perhaps with a single colon representing numerical order
 (exactly as you have written it) and two colons representing the order
 of the variables as they appear in the data frame (your first example).
 That's analogous to SAS' x1-xN, which you know gets those N variables,
 and a--z, which selects an unknown number of variables a through z. How
 many that is depends upon their order in the data frame. That would not
 only be very useful in general, but it would also make transitioning to
 R from SAS or SPSS less confusing.

 Is R still being extended in such basic ways, or does that muck up
 existing programs too much?


In principle base R can be extended like that, but a strong case is needed 
for non-standard evaluation rules and for depleting the restricted supply 
of short binary operator names.

The reason for subset() and its behaviour is that 'variables as they 
appear the in data frame' is typically ambiguous -- which data frame?  In 
SPSS you have only one and in SAS there is a default one, so there is no 
ambiguity in X1--Y2, but in R it needs another argument specifying the 
data frame, so it can't really be a binary operator.

The double colon :: and triple colon ::: are already used for namespaces, 
and a search of r-help reveals two previous, different, suggestions for 
%:%.


-thomas

Thomas Lumley   Assoc. Professor, Biostatistics
[EMAIL PROTECTED]   University of Washington, Seattle

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Max vs summary inconsistency

2007-08-27 Thread Thomas Lumley

On Mon, 27 Aug 2007, Adam D. I. Kramer wrote:

 Hello,

 I'm having the following questionable behavior:

 summary(m)
Min. 1st Qu.  MedianMean 3rd Qu.Max.
   1   13000   26280   25890   38550   50910
 max(m)
 [1] 50912

 typeof(m)
 [1] integer
 class(m)
 [1] integer

 ...it seems to me like max() and summary(m)[6] ought to return the same
 number. Am I doing something wrong?


They do return the same number, they just print it differently. summary() 
prints four significant digits by default.

  -thomas

Thomas Lumley   Assoc. Professor, Biostatistics
[EMAIL PROTECTED]   University of Washington, Seattle

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] oddity with method definition

2007-08-27 Thread Thomas Lumley

On Mon, 27 Aug 2007, Faheem Mitha wrote:


 Just wondered about this curious behaviour. I'm trying to learn about
 classes. Basically setMethod works the first time, but does not seem to
 work the second time.
 Faheem.
 *
 setClass(foo, representation(x=numeric))

 bar - function(object)
   {
 return(0)
   }

 bar.foo - function(object)
   {
 print([EMAIL PROTECTED])
   }
 setMethod(bar, foo, bar.foo)

 bar(f)

 # bar(f) gives 1.

Not for me. It gives
 bar(f)
Error: object f not found
Error in bar(f) : error in evaluating the argument 'object' in selecting a
method for function 'bar'

However, if I do
f = new(foo, x= 1)
first, it gives 1.

 bar - function(object)
   {
 return(0)
   }

Here you have masked the generic bar() with a new function bar(). Redefining 
bar() is the problem, not the second setMethod().

 bar.foo - function(object)
   {
 print([EMAIL PROTECTED])
   }
 setMethod(bar, foo, bar.foo)

Because there was a generic bar(), even though it is overwritten by the new 
bar(), setMethod() doesn't automatically create another generic.

 f = new(foo, x= 1)

 bar(f)

 # bar(f) gives 0, not 1.


Because bar() isn't a generic function
 bar
function(object)
   {
 return(0)
   }


If you had used setGeneric() before setMethod(), as recommended, your example 
would have done what you expected, but it would still have wiped out any 
previous methods for bar() -- eg, try
  setMethod(bar,baz, function(object) print(baz))
before you redefine bar(), and notice that getMethod(bar,baz) no longer 
finds it.



-thomas

Thomas Lumley   Assoc. Professor, Biostatistics
[EMAIL PROTECTED]   University of Washington, Seattle

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Saving results from Linux command line

2007-08-24 Thread Thomas Lumley


There could still be functions that divert a copy of all the output to a 
file, for example.  And indeed there are.

sink(transcript.txt, split=TRUE)

-thomas

On Fri, 24 Aug 2007, Muenchen, Robert A (Bob) wrote:

 I looked long and hard for that information. Thank you VERY much! -Bob

 -Original Message-
 From: Richard M. Heiberger [mailto:[EMAIL PROTECTED]
 Sent: Friday, August 24, 2007 1:52 PM
 To: Muenchen, Robert A (Bob); r-help@stat.math.ethz.ch
 Subject: Re: [R] Saving results from Linux command line

 There can't be functions in the R language to save the transcript
 of a session.  In this respect R is a filter.  It takes an input
 stream of text and returns an output stream of text.  R doesn't
 remember
 the streams.  The Windows RGui remembers them.  The ESS *R* buffer
 remembers
 them.  Any terminal emulator could in principle remember them.
 R itself can't.

 __
 R-help@stat.math.ethz.ch mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.


Thomas Lumley   Assoc. Professor, Biostatistics
[EMAIL PROTECTED]   University of Washington, Seattle

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] R old versions archive anywhere?

2007-08-23 Thread Thomas Lumley

On Thu, 23 Aug 2007, [EMAIL PROTECTED] wrote:

 Hi Folks,

 Does anyone know if (apparently not on CRAN) there is
 any archive of older versions of R packages?

Yes. On CRAN. At the bottom of the page listing all the packages there is 
a section
-
Related Directories

Archive
 Previous versions of the packages listed above.
-

linking to
[CRAN]/src/contrib/Archive


-thomas

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] All subsets regression

2007-08-22 Thread Thomas Lumley

On Wed, 22 Aug 2007, Alan Harrison wrote:

 Hey folks,

 I'm trying to do all subsets on a zero-inflated poisson regression. 
 I'm aware of the leaps and regsubsets functions but I don't know if they 
 work for ZIP regressions or how the syntax fits in for them.

They don't.

-thomas

Thomas Lumley   Assoc. Professor, Biostatistics
[EMAIL PROTECTED]   University of Washington, Seattle

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Regulatory Compliance and Validation Issues

2007-08-19 Thread Thomas Lumley

On Fri, 17 Aug 2007, Cody Hamilton wrote:

snip
 I have a few specific comments/questions that I would like to present to 
 the R help list.
snip
 2. While the document's scope is limited to base R plus recommended 
 packages, I believe most companies will need access to functionalities 
 provided by packages not included in the base or recommended packages. 
 (For example, I don't think I could survive without the sas.get() 
 function from the Design library.)  How can a company address the issues 
 covered in the document for packages outside its scope?  For example, 
 what if a package's author does not maintain historical archive versions 
 of the package?  What if the author no longer maintains the package? 
 Is the solution to add more packages to the recommended list (I'm fairly 
 certain that this would not be a simple process) or is there another 
 solution?

This will have to be taken up with the package maintainer.  The R 
Foundation doesn't have any definitive knowledge about, eg, Frank 
Harrell's development practices and I don't think the FDA would regard our 
opinions as relevant.

Archiving, at least, is addressed by CRAN: all the previously released 
versions of packages are available

 3. At least at my company, each new version must undergo basically the 
 same IQ/OQ/PQ as the first installation.  As new versions of R seem to 
 come at least once a year, the ongoing validation effort would be 
 painful if the most up-to-date version of R is to be maintained within 
 the company.  Is there any danger it delaying the updates (say updating 
 R within the company every two years or so)?

It's worse than that: there are typically 4 releases of R per year (the 
document you are commenting on actually gives dates).  The ongoing 
validation effort may indeed be painful, and this was mentioned as an 
issue in the talk by David James  Tony Rossini.

The question of what is missed by delaying updates can be answered by 
looking at the NEWS file. The question of whether it is dangerous is 
really an internal risk management issue for you.

-thomas

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Question about lme and AR(1)

2007-08-19 Thread Thomas Lumley

On Sat, 18 Aug 2007, Francisco Redelico wrote:

 Dear R users,


 As far as I know, EM algorithm can be only applied to estimate parameter 
 from a regular exponential family.

No. The EM algorithm will converge to a stationary point (and except in 
artificial cases, to a local maximum) for any likelihood.

The special case you may be thinking of is that in some problems the 
E-step is equivalent to computing E[missing data | observed data] rather 
than the more general E[loglikelihood|observed data]

-thomas

Thomas Lumley   Assoc. Professor, Biostatistics
[EMAIL PROTECTED]   University of Washington, Seattle

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] using sampling weights in glm

2007-08-15 Thread Thomas Lumley

On Wed, 15 Aug 2007, Werner Wernersen wrote:

 Hi,

 I have a supposedly representative sample from a
 stratified survey. Hence, all observations have an
 associated sample weight each which inflate the sample
 to population size.

 I want to run a glm probit regression on the data but
 I am not clear about if or better how I can use the
 weights in it. From the description of the weights
 argument of glm it seems to me that I cannot plug
 these weights in there.


You want svyglm() in the survey package.

  -thomas


Thomas Lumley   Assoc. Professor, Biostatistics
[EMAIL PROTECTED]   University of Washington, Seattle

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Regsubsets statistics

2007-08-09 Thread Thomas Lumley

On Wed, 8 Aug 2007, Markus Brugger wrote:


 Dear R-help,

 I have used the regsubsets function from the leaps package to do subset
 selection of a logistic regression model with 6 independent variables and
 all possible ^2 interactions. As I want to get information about the
 statistics behind the selection output, I´ve intensively searched the
 mailing list to find answers to following questions:

 1. What should I do to get the statistics behind the selection (e.g. BIC)?
 summary.regsubsets(object) just returns * meaning in or   meaning out.
 For the plot function generates BICs, it is obviously that these values must
 be computed and available somewhere, but where? Is it possible to directly
 get AIC values instead of BIC?

These statistics are in the object returned by summary(). Using the first 
example from the help page
 names(summary(a))
[1] which  rsqrssadjr2  cp bicoutmat obj   
 summary(a)$bic
[1] -19.60287 -28.61139 -35.65643 -37.23388 -34.55301


 2. As to the plot function, I´ve encountered a problem with setting the ylim
 argument. I fear that this (nice!) particular plot function ignores many of
 these additional arguments. How can I nevertheless change this setting?

You can't (without modifying the plot function). The ... argument is required 
for inheritance [ie, required for R CMD check] but it doesn't take graphical 
parameters

 3. For it is not explicitly mentioned in the manual, can I really use
 regsubsets for logistic regression?


No.  If your data set is large enough relative to the number of variables, you 
can fit a model with all variables and then apply regsubsets() to the weighted 
linear model arising from the IWLS algorithm.  This will give an approximate 
ranking of models that you can then refit exactly. This is useful if you wanted 
to summarize the best few thousand models on 30 variables but not if you want a 
single model. On the other hand, regsubsets() isn't useful if you want a single 
model anyway.


 -thomas



Thomas Lumley   Assoc. Professor, Biostatistics
[EMAIL PROTECTED]   University of Washington, Seattle

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Mixture of Normals with Large Data

2007-08-08 Thread Thomas Lumley

On Wed, 8 Aug 2007, Martin Maechler wrote:

 BertG == Bert Gunter [EMAIL PROTECTED]
 on Tue, 7 Aug 2007 16:18:18 -0700 writes:

  TV Have you considered the situation of wanting to
  TV characterize probability densities of prevalence
  TV estimates based on a complex random sample of some
  TV large population.

BertG No -- and I stand by my statement. The empirical
BertG distribution of the data themselves are the best
BertG characterization of the density. You and others are
BertG free to disagree.

 I do agree with you Bert.
 From a practical point of view however, you'd still want to use an
 approximation to the data ECDF, since the full ecdf is just too
 large an object to handle conveniently.

 One simple quite small and probably sufficient such
 approximation maybe
 using the equivalent of quantile(x, probs = (0:1000)/1000)
 which is pretty related to just working with a binned version of
 the original data; something others have proposed as well.


I have done Normal (actually logNormal) mixture fitting to pretty large data 
(particle counts by size) for summary purposes.  In that case it would not have 
done just as well to use quantiles as I had many sets of data (every three 
hours for several months) and the locations of the mixture components drift 
around  over time.  The location, scale, and mass of the four mixture 
components really were the best summaries. This was the application that 
constrOptim() was written for.

  -thomas

Thomas Lumley   Assoc. Professor, Biostatistics
[EMAIL PROTECTED]   University of Washington, Seattle

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] small sample techniques

2007-08-08 Thread Thomas Lumley

On Wed, 8 Aug 2007, Nair, Murlidharan T wrote:

 Indeed, I understand what you say. The df of freedom for the dummy example is 
 n1+n2-2 = 8. But when I run the t.test I get it as 5.08, am I missing 
 something?


Yes. You are probably looking for the version of the t.test that assumes equal 
variances (the original one), so you need var.equal=TRUE.

  -thomas


 -Original Message-
 From: Moshe Olshansky [mailto:[EMAIL PROTECTED]
 Sent: Tuesday, August 07, 2007 9:05 PM
 To: Nair, Murlidharan T; r-help@stat.math.ethz.ch
 Subject: Re: [R] small sample techniques

 Hi Nair,

 If the two populations are normal the t-test gives you
 the exact result for whatever the sample size is (the
 sample size will affect the number of degrees of
 freedom).
 When the populations are not normal and the sample
 size is large it is still OK to use t-test (because of
 the Central Limit Theorem) but this is not necessarily
 true for the small sample size.
 You could use simulation to find the relevant
 probabilities.

 --- Nair, Murlidharan T [EMAIL PROTECTED] wrote:

 If my sample size is small is there a particular
 switch option that I need to use with t.test so that
 it calculates the t ratio correctly?

 Here is a dummy example?

 á =0.05

 Mean pain reduction for A =27; B =31 and SD are
 SDA=9 SDB=12

 drgA.p-rnorm(5,27,9);

 drgB.p-rnorm(5,31,12)

 t.test(drgA.p,drgB.p) # what do I need to give as
 additional parameter here?



 I can do it manually but was looking for a switch
 option that I need to specify for  t.test.



 Thanks ../Murli




  [[alternative HTML version deleted]]

 __
 R-help@stat.math.ethz.ch mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained,
 reproducible code.


 __
 R-help@stat.math.ethz.ch mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.


Thomas Lumley   Assoc. Professor, Biostatistics
[EMAIL PROTECTED]   University of Washington, Seattle

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Robust Standard Errors for lme object

2007-08-07 Thread Thomas Lumley

On Tue, 7 Aug 2007, Doran, Harold wrote:

 Lucy:

 Why are you interested in robust standard errors from lme? Typically,
 robust standard errors are sought when there is model misspecification
 due to ignoring some covariance among units with a group.

 But, a mixed model is designed to directly account for covariances among
 units within a group such that the standard errors more adequately
 represent the true sampling variance of the parameters.


I think it's a perfectly reasonable thing to want, but it is not easy to 
provide because of the generality of lme().  For example, models with crossed 
effects need special handling, and possibly so do time-series or spatial 
covariance models with large numbers of observations per group.

I imagine that misspecification of the variance, rather than the correlation, 
would be the main concern, just as it is with independent observations. Of 
course the model-robust variances would only be useful if the sample size of 
top-level units was large enough, and if the variance components were not of 
any direct interest.


 So, the lme standard errors are robust in a sense that they are
 presumably correct if you have your model correctly specified.

To paraphrase the Hitchikers' Guide: This must be some definition of the word 
'robust' that I was not previously aware of. :)


  -thomas

Thomas Lumley   Assoc. Professor, Biostatistics
[EMAIL PROTECTED]   University of Washington, Seattle

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

[R] [R-pkgs] surveyNG (and survey)

2007-08-04 Thread Thomas Lumley


'surveyNG' version 0.3 is on CRAN.

 This package provides experimental features for survey analysis that may be 
incorporated in the survey package in the future. Currently there are 
facilities for analysis of complex surveys using (possibly large) data sets 
stored in a SQLite database. However, analysis facilities for these SQL-backed 
survey designs are rather more limited than in the 'survey' package.

Version 0.3 adds hexagonal binning plots and kernel smoothing.


Also, the 'survey' package hasn't been announced on this list since version 2.9 
in 2005 and verison 3.6-11 was recently posted.  It provides fairly 
comprehensive facilities for analysis of complex survey designs.  Major 
additions since 2.9 are calibration estimators (aka GREG or generalized 
raking), simple two-phase designs, and smoothing.


 -thomas

Thomas Lumley   Assoc. Professor, Biostatistics
[EMAIL PROTECTED]   University of Washington, Seattle

___
R-packages mailing list
[EMAIL PROTECTED]
https://stat.ethz.ch/mailman/listinfo/r-packages

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] constrOptim

2007-08-02 Thread Thomas Lumley

On Wed, 1 Aug 2007, Joanne Lee wrote:

 The constraints on the optimization problem are:
 1 - components of potentialargmin must add to 1.
 2 - each potentialargmin component must be (weakly) positive and
 (weakly) less than 1.
 3 - potentialargmin %*% c(1,2,3,4,5,6,7,8,9) = 4.5


constrOptim() is not good at equality constraints, so constraints 1 and 3 
would be better expressed by reparametrization.  There is no need to 
constrain the parameters to be positive as the function already goes 
infinite at zero.  Finally, if the parameters are non-negative there is no 
need to constrain them to be =1, as this is implied by the first 
constraint.

It might be interesting to find out if some automated Lagrange-multiplier 
approach could be built into constrOptim() for equality constraints, but 
it is not a high enough priority that I am likely to do it.

-thomas

Thomas Lumley   Assoc. Professor, Biostatistics
[EMAIL PROTECTED]   University of Washington, Seattle

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Nonlinear optimization with constraints

2007-08-02 Thread Thomas Lumley

On Wed, 1 Aug 2007, Patrick Burns wrote:

 If you put in penalties for breaking constraints,
 it is generally better to make the penalty depend
 on the size of the violation.  This keeps the function
 continuous (though usually not differentiable at the
 boundary), and gives the optimizer a hint about
 which way to go.

And this is what constrOptim does (though in a slightly more sophisticated 
way that keeps the function twice-differentiable at the boundary).  I 
don't know why constrOptim() isn't applicable here -- the constraint looks 
like a linear inequality to me.

-thomas


Thomas Lumley   Assoc. Professor, Biostatistics
[EMAIL PROTECTED]   University of Washington, Seattle

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] - round() strange behaviour

2007-08-02 Thread Thomas Lumley

On Thu, 2 Aug 2007, Monica Pisica wrote:


 Hi,

 I am getting some strange results using round - it seems that it depends if 
 the number before the decimal point is odd or even 

Yes. This is explained in the help page for round().

   -thomas

Thomas Lumley   Assoc. Professor, Biostatistics
[EMAIL PROTECTED]   University of Washington, Seattle

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Q: calling par() in .First()

2007-08-02 Thread Thomas Lumley

On Thu, 2 Aug 2007, D. R. Evans wrote:

 As a result of another thread here, I need to be sure that the
 function par(bg='white') has executed before I create a plot. The
 simplest thing seemed to put it in .Rprofile:

 .First - function() {
  options(width=150)
  par(bg='white')
 }

 But now R complains at startup:
  Error in .First() : couldn't find function par

 I have confirmed that once R has started, I can type par(bg='white')
 and R no longer complains.

 ISwR has a nice little section on Using par, but there's no hint
 that something like this can go wrong :-(

 So what am I not understanding this time?

par() is in the 'graphics' package, which is not loaded by the time .Rprofile 
runs.  You want
graphics::par(bg='white')

Information from which this can be deduced and examples are in ?Startup, though 
it isn't explicitly stated there.

  -thomas

Thomas Lumley   Assoc. Professor, Biostatistics
[EMAIL PROTECTED]   University of Washington, Seattle

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] reading stata files: preserving values of variables converted to factors

2007-07-30 Thread Thomas Lumley

On Thu, 26 Jul 2007, Ben Saylor wrote:

 Hi,

 I am a Stata user new to R.  I am using read.dta to read a Stata file
 that has variables with value labels.  read.dta converts them to
 factors, but seems to recode them with values from 1 to number of
 factor levels (looking at the output of unclass(varname)), so the
 original numerical values are lost.

Yes. The R factor type should not be used if you want the original levels. 
It is not a 'labelled numeric' type and the numbers are an implementation 
detail.

  Using convert.factors=FALSE
 preserves the values, but seems to discard the labels.

It doesn't discard the labels. They are kept in the attributes of the data 
frame.

-thomas

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Minitab Parametric Distribution Analysis in R

2007-07-25 Thread Thomas Lumley


The survival package (survreg() function) will fit quite a few parametric 
models under censoring.

If you aren't doing regression, but just one-sample fitting, you can feed the 
appropriate censored or truncated likelihood to mle() in the stat4 package.

Both packages should be part of your R distribution.

-thomas



On Wed, 25 Jul 2007, Tom La Bone wrote:

 Minitab can perform a Parametric Distribution Analysis - Arbitrary
 Censoring with one of eight distributions (e.g., weibull), giving the
 maximum likelihood estimates of the parameters in the distribution for a
 given dataset. Does R have a package that provides equivalent functionality?
 Thanks for any advice you can offer.



 Tom La Bone


   [[alternative HTML version deleted]]

 __
 R-help@stat.math.ethz.ch mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.


Thomas Lumley   Assoc. Professor, Biostatistics
[EMAIL PROTECTED]   University of Washington, Seattle

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] (no subject)

2007-07-23 Thread Thomas Lumley


It's a FAQ (Question 7.32)

-thomas

On Mon, 23 Jul 2007, [EMAIL PROTECTED] wrote:


 Dear Sir/Madam,

 I am running a R program for my SNP data. There are some errors when I run
 glm model in Hapassoc software, sometimes it is over the memory and
 sometimes the matrix is singular. I want to ingore these errors and excute
 the next statement. Otherwise, I have a big big trouble. Do you have some
 idea about this problem of ingore errors.

 Wish to get your help assp.

 thanks.

 -- 
 Wei Guo
 Department of Statistics
 The Ohio State University
 Columbus, Ohio 43210
 Tel: 001-614-292-4713(o)
 e-mail: [EMAIL PROTECTED]

 __
 R-help@stat.math.ethz.ch mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.


Thomas Lumley   Assoc. Professor, Biostatistics
[EMAIL PROTECTED]   University of Washington, Seattle

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] GEE code

2007-07-20 Thread Thomas Lumley

On Fri, 20 Jul 2007, Dimitris Rizopoulos wrote:

 from your description I think you need a random-effects model.

Since he specifically said fixed effects I would have assumed from the 
description that he wanted a fixed-effects model.

Since
 you assume normality the parameter estimates from a random-effects
 model (in your case `GDP.log2') have a marginal interpretation (in
 fact they have both conditional and marginal interpretation).

But they still aren't the fixed effects estimates, and they behave 
quite differently under model misspecification (and under confounding).

A fixed effects linear model with GEE just needs the formula
   ineq~GDP.log2+country.
to specify an indicator variable for each country.

If he had a logistic regression model things would be more complicated, 
but for a linear or log-linear model it is just a matter of adding 
predictors.

Now, I might well use a linear mixed model in this context, but he did 
fairly clearly indicate that wasn't he was looking for.

-thomas

Thomas Lumley   Assoc. Professor, Biostatistics
[EMAIL PROTECTED]   University of Washington, Seattle

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] R

2007-07-19 Thread Thomas Lumley

On Thu, 19 Jul 2007, Fluss wrote:
 Hello!
 I am using for logistic regression in survey data the svyglm procedure.
 I wondered how does the strata effect estimates SE (in addition to the
 weights given proportional to population size).
 I know that for simple regression measurements of each strata is assumed to
 have different variance.
 But in a logistic model this is not the case.

It is simpler (and more complicated) than that.  The survey package uses 
the same formula for nearly all designs and estimators, so it doesn't 
have to handle special cases like estimating separate stratum variances 
for stratified models.

For a population total the variance estimator is just the Horvitz-Thompson 
estimator.  Other estimators are defined by the estimating equations they 
solve, so the mean solves
   sum_i w_i(x_i-mu) = 0
and logistic regression solves
   sum_i w_ix_i(y_i-mu_i) = 0

We now compute the Horvitz-Thompson estimator for the sum of the 
estimating functions (V) and also the population total of the derivative 
of the estimating functions (D). The variance of the estimator is
   D^{-1}VD^{-1}


The standard reference for this in the survey literature seems to be
  Binder, David A. (1983).  On the variances of asymptotically
  normal estimators from complex surveys.  International Statistical
  Review, 51, 279-292.
which is in the References section of help(svyrecvar).

-thomas

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Complex surveys, properly computed SEs and non-parametric analyses

2007-07-15 Thread Thomas Lumley

On Sun, 15 Jul 2007, Tobias Verbeke wrote:
 The survey package of Thomas Lumley has very broad functionality for the
 analysis of data from complex sampling designs. Please find below the
 homepage of the package (which is available on CRAN):

 http://faculty.washington.edu/tlumley/survey/

 I don't think non-parametric one-way ANOVA is implemented

No.

 but quoting
 http://faculty.washington.edu/tlumley/survey/survey-wss.pdf
 Many features of the survey package result from requests from
 unsatisfied users.

 For new methods the most important information is a reference
 that gives sufficient detail for implementation. A data set is nice
 but not critical.


Yes, and the details are especially non-obvious here.  The test won't be 
small-sample exact, AFAICS, and it isn't clear whether the idea is to add 
weights to the influence function for the signed-rank test or to replace 
it with a design-based estimate of a population quantity. Often these 
approaches are equivalent, but they won't be in this case. It wouldn't 
have occured to me that people would want this. `Non-parametric' isn't 
really a relevant idea since design-based inference assumes a completely 
known model for the sampling indicators and doesn't even treat the data as 
random variables.

All this goes to say that if there is a standard quantity that John wants, 
it will have resulted in part from a set of arbitrary decisions, and it 
won't be possible to reverse-engineer the estimator in the absence of a 
precise description.  This is in contrast to apparently more complicated 
analyses such as calibration estimators for Cox models in case-cohort
designs, which follow just by putting standard pieces together in an 
obvious way.


-thomas

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Help with write.foreign (exporting data to Stata)

2007-07-10 Thread Thomas Lumley

On Tue, 10 Jul 2007, Stefan Grosse wrote:

 I am not sure what you are doing there but what you need is
 library(foreign)
 and
 write.dta()


write.foreign should also work, though.

My guess is that Kate used tempfile() to specify the filenames, and that the 
data file would then have been deleted on leaving R.  This is only a guess, of 
course.

The syntax for write.dta is
   write.dta(the.data.set, file=dataset.dta)
and for write.foreign is
   write.foreign(the.data.set,codefile=dataset.do, datafile=dataset.raw,
package=Stata)

 -thomas


 see
 ?write.dta once you have loaded the foreign package

 Stefan

  Original Message  
 Subject: [R] Help with write.foreign (exporting data to Stata)
 From: kdestler [EMAIL PROTECTED]
 To: r-help@stat.math.ethz.ch
 Date: Tue Jul 10 2007 19:37:54 GMT+0200
 Hi.  I'm trying to export a dataframe from R into Stata to use a statistical
 function I have there.  I attached library write.foreign and renamed my
 variables to get them to match Stata's required format, and now have the
 following error:  file /tmp/Rtmps7rmrM/file1c06dac8.raw not found  Other
 than typing write.foreign, do I need to do something in R to get it to save
 the file on my hard drive?  When I search for the file name on my computer
 nothing comes up.  I'm using a Mac in case that makes a difference.

 Thanks,
 Kate


 __
 R-help@stat.math.ethz.ch mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.


Thomas Lumley   Assoc. Professor, Biostatistics
[EMAIL PROTECTED]   University of Washington, Seattle

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] character string to name

2007-07-09 Thread Thomas Lumley

On Mon, 9 Jul 2007, Jim Lemon wrote:

 Hi folks,

 I thought I recalled a request for turning a character string into an
 object name as in:

Yes. It's a FAQ.

-thomas


 x$as.name(y)-1:4

 OR

 x-data.frame(as.name(y)=1:4)

 However, as.name and a few other uninformed attempts didn't even come
 close. A search of character to name produced no helpful functions.
 This isn't a very urgent request, but if anyone knows some trick to
 perform this transformation, I would like to hear about it. Thanks.

 Jim

 __
 R-help@stat.math.ethz.ch mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.


Thomas Lumley   Assoc. Professor, Biostatistics
[EMAIL PROTECTED]   University of Washington, Seattle

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Is it a bug ?

2007-07-05 Thread Thomas Lumley

On Thu, 5 Jul 2007, Giuseppe PEDRAZZI wrote:
 I am using R 2.5.0,  windows XP - italian language.

 I was perfoming some calculation on fractional exponential and
 I found a strange behaviour. I do not know if it is really a bug, but I would 
 expect
 a different answer from R.

 I was trying the following :

 x - seq(-3,3, by =0.1)
 n - 2.2
 y - exp(-x^n)

 well, the y vector contains (NaN for all negative value of x)

Yes. Non-integer powers of negative numbers are undefined (unless you use 
complex numbers).

 but if you ask for single value calculation like

 y - exp(-(-3)^2.2) or

 y - exp(-(-2.9)^2.2)

 the answer is correct.

I get NaN for both of these.  Perhaps you mean exp(-2.9^2.2)? This gives 
a valid answer, but that is because it is exp(-(2.9^2.2)) not 
exp((-2.9)^2.2)

 It seem it does not make the calculation in vector form.

 I got the same behaviour (NaN)  in a for loop

 for(i in 1:length(x)) y[i]=exp(x[i]^n)
 y
 [1]   NaN  NaN  NaN  NaN  NaN 
  NaN  NaN  NaN  NaN
 [10]  NaN  NaN  NaN  NaN  NaN 
  NaN  NaN  NaN  NaN
 [19]  NaN  NaN  NaN  NaN  NaN 
  NaN  NaN  NaN  NaN
 [28]  NaN  NaN  NaN 1.00 1.006330 
 1.029416 1.073302 1.142488 1.243137
 [37] 1.384082 1.578166 1.844237 2.210260 2.718282 
 3.432491 4.452553 5.936068 8.137120
 [46]11.47374616.64841524.86768038.25129560.611092
 98.967689   166.572985   289.08   517.425935
 [55]   955.487320  1820.793570  3581.521323  7273.674928 15255.446778 
 33050.861013 73982.100407


 Is it strange or did I miss something ?

You missed something. It is not clear what you missed because some of your 
examples do not give the answer you say they give.

-thomas

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Determining whether a function's return value is assigned

2007-07-02 Thread Thomas Lumley

On Sat, 30 Jun 2007, Paul Laub wrote:

 Dear all,

 Does R offer a means by which a function can determine
 whether its return value is assigned? I am using R
 2.4.1 for Windows.

No

 Suppose what I am looking for is called
 return.value.assigned. Then one might use it like
 this

myfunction - function () {
# Create bigobject here

if (return.value.assigned()) {
bigobject
} else {
summary(bigobject)
}
}

That's what the print() function is for. Write a print() method.

-thomas

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] extract index during execution of sapply

2007-06-22 Thread Thomas Lumley

On Fri, 22 Jun 2007, Ben Bolker wrote:

 Christian Bieli christian.bieli at unibas.ch writes:


 Hi there
 During execution of sapply I want to extract the number of times the
 function given to supply has been executed. I came up with:

 mylist - list(a=3,b=6,c=9)
 sapply(mylist,function(x)as.numeric(gsub([^0-9],,deparse(substitute(x)

 This works fine, but looks quite ugly. I'm sure that there's a more
 elegant way to do this.

 Any suggestion?

 Christian


   I would love to have an answer to this -- when I run
 into this kind of problem I usually end up using mapply:
 e.g., suppose I have

 mylist - replicate(5,list(x=runif(10),y=runif(10)),simplify=FALSE)

 and I want to plot each element in a different color.  I'd like
 to be able to do

 plot(0:1,0:1,type=n)
 lapply(mylist,plot,col=i)

 but instead I do

 mapply(function(x,i) points(x,col=i),mylist,1:5)

 would it be too ugly to have a special variable called INDEX
 that could be used within an sapply/lapply statement?


There are two distinct suggestions here: a variable that says *how many* 
times the function has been called, and a variable that say *which 
element* is currently being operated on.   The first seems undesirable as 
order of evaluation really should not matter in the apply functions.

The second makes more sense but is still a little tricky. AFAICS there is 
no way for lapply() to find out whether FUN will accept an argument INDEX 
without an unused argument(s) error, so it can't just be passed as an 
argument.  This suggests having yet another apply function, that would 
assume an INDEX argument and might be written
   yapply-function(X,FUN, ...) {
index-seq(length.out=length(X))
 mapply(FUN,X,INDEX=index,MoreArgs=list(...))
}

However, I think it would be preferable in many cases for INDEX to be 
names(X) if it exists, rather than 1:n.  In any case, it is easy  to write 
the function.

-thomas

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] model selection criteria in regsubsets

2007-06-21 Thread Thomas Lumley


The calculations are in summary.regsubsets.

Sending three copies of questions like this does not increase the chance 
of a response.

-thomas


On Thu, 21 Jun 2007, [EMAIL PROTECTED] wrote:

 Hi All,
 
 I used regsubsets in package leaps to do the model subset selection.
 I noticed the bic and cp criterias are both included in this
 function. But seems like they are not calculated by
 
 bic=-n*log(RSS/n)-(p+1)*log(n) 
 and 
 cp=(RSS/sigma_hat^2)-(n-2*p-2) 
 
 Could you please let me know what formula used for these two criterias?
 
 Thank you !
 
 Linda


Thomas Lumley   Assoc. Professor, Biostatistics
[EMAIL PROTECTED]   University of Washington, Seattle

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] psm/survreg coefficient values ?

2007-06-19 Thread Thomas Lumley

On Tue, 19 Jun 2007, John Logsdon wrote:

 In survreg() the predictor is log(characteristic life) for Weibull (=
 exponential when scale=1) - ie the 63.2%ile.  For the others the predictor is
 log(median).

 This causes problems when comparing predictions and a better way IMHO is to
 correct the Weibull prediction by a factor (log(2))^(1/scale).  This is only
 a simple multiple unless the shape parameter is also being modelled, when a
 completely different solution may arise.  Such heterogeneity modelling cannot
 of course be done within survreg().

Except, of course, for a discrete predictor of heterogeneity, using 
strata().

-thomas


 On Monday 18 June 2007 22:56:54 Frank E Harrell Jr wrote:
 sj wrote:
 I am using psm to model some parametric survival data, the data is for
 length of stay in an emergency department. There are several ways a
 patient's stay in the emergency department can end (discharge, admit,
 etc..) so I am looking at modeling the effects of several covariates on
 the various outcomes. Initially I am trying to fit a  survival model for
 each type of outcome using the psm function in the design package,  i.e.,
 all  patients who's visits  come to an end  due to  any event other than
 the event of interest are considered to be censored.  Being new to the
 psm and  survreg packages (and to parametric survival modeling) I am not
 entirely sure how to interpret the coefficient values that psm returns. I
 have included the following code to illustrate code similar to what I am
 using on my data. I suppose that the coefficients are somehow rescaled ,
 but I am not sure how to return them to the original scale and make sense
 out of the coefficients, e.g., estimate the the effect of higher acuity
 on time to event in minutes. Any explanation or direction on how to
 interpret the  coefficient values would be greatly appreciated.

 this is from the documentation for survreg.object.
 coefficientsthe coefficients of the linear.predictors, which multiply the
 columns of the model matrix. It does not include the estimate of error
 (sigma). The names of the coefficients are the names of the
 single-degree-of-freedom effects (the columns of the model matrix). If
 the model is over-determined there will be missing values in the
 coefficients corresponding to non-estimable coefficients.

 code:
 LOS - sort(rweibull(1000,1.4,108))
 AGE - sort(rnorm(1000,41,12))
 ACUITY - sort(rep(1:5,200))
 EVENT -  sample(x=c(0,1),replace=TRUE,1000)
 psm(Surv(LOS,EVENT)~AGE+as.factor(ACUITY),dist='weibull')

 output:

 psm(formula = Surv(LOS, CENS) ~ AGE + as.factor(ACUITY), dist =
 weibull)

Obs Events Model L.R.   d.f.  P R2
   10005132387.62  5  0   0.91

   Value  Std. Error  z p
 (Intercept) 1.10550.04425  24.98 8.92e-138
 AGE 0.07720.00152  50.93  0.00e+00
 ACUITY=2 0.09440.01357   6.96  3.39e-12
 ACUITY=3 0.17520.02111   8.30  1.03e-16
 ACUITY=4 0.13910.02722   5.11  3.18e-07
 ACUITY=5-0.05440.03789  -1.43  1.51e-01
 Log(scale)-2.72870.03780 -72.18  0.00e+00

 Scale= 0.0653

 best,

 Spencer

 I have a case study using psm (survreg wrapper) in my book.  Briefly,
 coefficients are on the log median survival time scale.

 Frank



 -- 
 Best wishes

 John

 John Logsdon   Try to make things as simple
 Quantex Research Ltd, Manchester UK as possible but not simpler
 [EMAIL PROTECTED]  [EMAIL PROTECTED]
 +44(0)161 445 4951/G:+44(0)7717758675   www.quantex-research.com

 __
 R-help@stat.math.ethz.ch mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.


Thomas Lumley   Assoc. Professor, Biostatistics
[EMAIL PROTECTED]   University of Washington, Seattle

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] BIC and Hosmer-Lemeshow statistic for logistic regression

2007-06-19 Thread Thomas Lumley

On Tue, 19 Jun 2007, spime wrote:

Is there any windows version of Design package???

Not at the moment. It is being updated for changes in R 2.5.0.

[This would be a FAQ except that it should stop being asked soon]

-thomas

Frank E Harrell Jr wrote:

spime wrote:

I haven't find any helpful thread. How can i calculate BIC and
Hosmer-Lemeshow statistic for a logistic regression model. I have used
glm
for logistic fit.

See the Design package's lrm function and residuals.lrm for a better GOF
test.

--
Frank E Harrell Jr Professor and Chair School of Medicine
Department of Biostatistics Vanderbilt University

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

--
View this message in context:
http://www.nabble.com/BIC-and-Hosmer-Lemeshow-statistic-for-logistic-regression-tf3945943.html#a11195410
Sent from the R help mailing list archive at Nabble.com.

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Thomas Lumley Assoc. Professor, Biostatistics
[EMAIL PROTECTED] University of Washington, Seattle

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Error handling

2007-06-19 Thread Thomas Lumley


This is FAQ 7.32 How can I capture or ignore errors in a long simulation?

-thomas

On Tue, 19 Jun 2007, Peter Sajosi wrote:

 Hello,

  I have a question about error handling. I run simulation studies and often 
 the program stops with an error, for example during maximum likelihood. I 
 would like the program not to stop but to continue and I would like to ask 
 how the error handling can be set up for this (if it can). I tried to look 
 through manuals etc but unfortunately did not get closer to the solution. 
 Below is a small example with some numbers, where the nlm function gives an 
 error. Is it possible to make R and the program ignore the error?

  (there is a small for loop in the end of the example, which breaks - ideally 
 the program would complete the for loop even though there are errors). Of 
 course this is just an example, in the simulation study the error comes up 
 quite rarely but still it is annoying to handle it manually each time.
  Many thanks
 Peter

  The example:
  

 logfunc - function (params) {
  vutil1 - as.matrix(x2[,1:3]) %*% params
 vutil2 - as.matrix(x2[,4:6]) %*% params
  logl - 0
  for (i in 1:6) {
  prob - 
 log((exp(vutil1[i])*achoices[i,1]+exp(vutil2[i])*achoices[i,2])/(exp(vutil1[i])+exp(vutil2[i])))
  logl - logl + prob
  }
  return (-logl)
  }

 x2 - array(c(0,4,1,3,5,3,3,2,1,4,1,2,0,2,2,1,1,4,1.2233310 ,0.000 
 ,0.8155540 ,0.9320617 ,1.4272195 ,1.8349965 , 0.6116655, 3.2622160, 
 0.8155540, 3.7282469,0.000 ,4.5874913 ,0.6116655,3.2622160 ,1.6311080 
 ,1.8641235, 4.2816586, 0.9174983),dim=c(6,6))
 achoices - array(c(1,0,1,0,1,0,0,1,0,1,0,1),dim=c(6,2))
  for (k in 1:5) {
  nlm(logfunc, c (1,1,1),print.level=2)
  }
  ---
  Thanks!!!
  ---




 -
 Moody friends. Drama queens. Your life? Nope! - their life, your story.

   [[alternative HTML version deleted]]

 __
 R-help@stat.math.ethz.ch mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.


Thomas Lumley   Assoc. Professor, Biostatistics
[EMAIL PROTECTED]   University of Washington, Seattle

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] importing .dta files

2007-06-15 Thread Thomas Lumley

On Fri, 15 Jun 2007, Chris Linton wrote:

 I'm trying to read in a Stata file but I've never used this function (
 read.dta).  It's the only one that seems to come close to working, but I
 keep getting this error:

 data-read.dta(C:/Documents and
 Settings/Chris/Desktop/S4412/catestscores.dta)
 Error in read.dta(C:/Documents and
 Settings/Chris/Desktop/S4412/catestscores.dta,  :
a binary read error occurred


 There's little chance the data is corrupt considering it came from my
 professor and he used the data earlier.  So, either I'm doing something
 wrong or R just doesn't like to read in Stata files.  If it's a problem with
 R, how can I easily convert the file without purchasing Stata?


R does read Stata files -- I use this facility frequently.  It's hard to 
tell why it isn't working in your case, since we don't know anything about 
the file, your version of R, version of Stata, etc (we can guess you are 
on windows from the file name).

The error message implies that the file was found, and that it started 
with the right sequence of bytes to be a Stata .dta file, but that 
something (probably the end of the file) prevented R from reading what it 
was expecting to read.

This is why (in the absence of any further information) the natural 
suspicion is that the file is corrupt.  It is possible that we have 
misunderstood some unusual possibility in the Stata file format -- this 
has happened once before -- but it is fairly well documented.  In any 
case, there is not much that can be done without more information.


-thomas

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] export to a dat file that SAS can read

2007-06-13 Thread Thomas Lumley

On Wed, 13 Jun 2007, Uwe Ligges wrote:


 Rina Miehs wrote:
 Hello

 i have a data frame in R that some SAS users need to use in their
 programs, and they want it in a dat file, is that possible?

 What is a dat file?


 and which functions to use for that?

 I *guess* write.table() will do the trick, given dat is what I guess
 it is...

Another approach (that preserves factor levels if you have them) is to use 
write.foreign in the 'foreign' package. This writes a text file for the 
data and also writes SAS code to read the file.

-thomas

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Looking for R-code for non-negative matrix factorization in the presence of Gaussian or Poisson noise

2007-06-11 Thread Thomas Lumley

On Mon, 11 Jun 2007, [EMAIL PROTECTED] wrote:


 Hi all,

 Has any of you implemented code for non-negative matrix factorization to 
 solve

 Y=T P' +E; dim(Y)=n,p ; dim(T)=n,nc; dim (P)=(p,nc); dim(E)=n,p

 where T and P must be non-negative and E either Gaussian or Poisson noise.

 I'm looking for two variants:

 1. Easy (I think), T is known (that is we just want to solve the general 
 inverse problem)

This is non-negative least squares, a quadratic programming problem, for 
which there is code (at least if n and nc are not too big)


 2. Harder (?), T is unknown (under some restrictions) [as an 
 intermediate, we may want to fix nc]


Even with fixed nc this is Distinctly Non-trivial. It often isn't 
identifiable, for a start.

I've encountered this problem in air pollution source apportionment, where 
people use an algorithm due to Paatero (1999) JCGS 8:854-8, which is a 
conjugate gradient algorithm that handles the constraints by creative 
abuse of preconditioning.  The algorithm seems fairly well-behaved, 
although the statistical properties of the estimates are not 
well-understood [at least, I don't understand them, and I have simulations 
that appear to contradict the views of people who claim to understand 
them].

The difficulty probably depends on the size of the problem -- the air 
pollution problems have n~1000, p~20, nc~7, or larger.

-thomas

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] name of the variable that will contain the result of a function

2007-06-06 Thread Thomas Lumley

On Wed, 6 Jun 2007, Benilton Carvalho wrote:

 Hi everyone,

 say I have a function called 'foo', which takes the argument arg1.

 Is there any mechanism that I can use to learn about the variable
 where foo(arg1) is going to be stored?

No. This information isn't available explicitly even at the C level.

 For example:

 x - foo(arg1)

 so, inside foo() I'd like to be able to get the string x.

 if,

 foo(arg1)

 was used insted, I'd like to get NA.

It could be much worse that this, for example,
x[[7]][y][[4]] - foo(arg1)
w - foo(arg2)+1
names(x)[foo(arg3)] - foo(arg4)

-thomas

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] summing up colum values for unique IDs when multiple ID's exist in data frame

2007-05-29 Thread Thomas Lumley

On Tue, 29 May 2007, Seth Falcon wrote:

 Young Cho [EMAIL PROTECTED] writes:

 I have data.frame's with IDs and multiple columns. B/c some of IDs
 showed up more than once, I need sum up colum values to creat a new
 dataframe with unique ids.

 I hope there are some cheaper ways of doing it...  Because the
 dataframe is huge, it takes almost an hour to do the task.  Thanks
 so much in advance!

 Does this do what you want in a faster way?



rowsum() should probably be faster (but perhaps not much).

   -thomas

Thomas Lumley   Assoc. Professor, Biostatistics
[EMAIL PROTECTED]   University of Washington, Seattle

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] normality tests [Broadcast]

2007-05-28 Thread Thomas Lumley

On Mon, 28 May 2007, Martin Maechler wrote:

 LuckeJF == Lucke, Joseph F [EMAIL PROTECTED]
 on Fri, 25 May 2007 12:29:49 -0500 writes:

LuckeJF  Most standard tests, such as t-tests and ANOVA,
LuckeJF are fairly resistant to non-normalilty for
LuckeJF significance testing. It's the sample means that
LuckeJF have to be normal, not the data.  The CLT kicks in
LuckeJF fairly quickly.

 Even though such statements appear in too many (text)books,
 that's just plain wrong practically:

 Even though *level* of the t-test is resistant to non-normality,
 the power is not at all!!  And that makes the t-test NON-robust!

While it is true that this makes the t-test non-robust, it doesn't mean 
that the statement is just plain wrong practically.

The issue really is more complicated than a lot of people claim (not you 
specifically, Martin, but upthread and previous threads).

Starting with the demonstrable mathematical facts:
  - lots of rank tests are robust in the sense of Huber
  - rank tests are optimal for specific location-shift testing problems.
  - lots of rank tests have excellent power for location shift alternatives
 over a wide range of underlying distributions.
  - rank tests fail to be transitive when stochastic ordering is not
assumed (they are not consistent with any ordering on all distributions)
  - rank tests do not lead to confidence intervals unless a location shift
or similar one-dimensional family is assumed
  - No rank test is uniformly more powerful than any parametric test or
vice versa (if we rule out pathological cases)
  - there is no rank test that is consistent precisely against a difference
in means
  - the t-test (and essentially all tests) can be made distribution-free in
large samples (for small values of 'large', usually)
  - being distribution-free does not guarantee robustness of power (for the
t-test or for any other test)


Now, if we assume stochastic ordering is the Wilcoxon rank-sum test more 
or less powerful than the t-test?  Everyone knows that this depends on the 
null hypothesis distribution.  Fewer people seem to know that it also 
depends on the alternative, especially in large samples.

Suppose the alternative of interest is not that the values are uniformly 
larger by 1 unit, but that 5% of them are about 20 units larger.  The 
Wilcoxon test -- precisely because it gives less weight to outliers -- 
will have lower power.  For example (ObR)

one.sim-function(n, pct, delta){
 x-rnorm(n)
 y-rnorm(n)+delta*rbinom(n,1,pct)
 list(x=x,y=y)
 }

mean(replicate(100, {d-one.sim(100,.05,20); t.test(d$x,d$y)$p.value})0.05)
mean(replicate(100, {d-one.sim(100,.05,20); 
wilcox.test(d$x,d$y)$p.value})0.05)

mean(replicate(100, {d-one.sim(100,.5,1); t.test(d$x,d$y)$p.value})0.05)
mean(replicate(100, {d-one.sim(100,.5,1); wilcox.test(d$x,d$y)$p.value})0.05)


Since both relatively uniform shifts and large shifts of small fractions 
are genuinely important alternatives in real problems it is true in 
practice as well as in theory that neither the Wilcoxon nor the t-test is 
uniformly superior.

This is without even considering violations of stochastic ordering -- 
which are not just esoteric pathologies, since it is quite plausible for a 
treatment to benefit some people and harm others. For example, I've seen 
one paper in which a Wilcoxon test on medical cost data was statistically 
significant in the *opposite direction* to the difference in means.

This has been a long rant, but I keep encountering statisticians who think 
anyone who ever recommends a t-test just needs to have the number 0.955 
quoted to them.

snip

LuckeJF Testing for normality prior to choosing a test
LuckeJF statistic is generally not a good idea.

 Definitely. Or even: It's a very bad idea ...


I think that's something we can all agree on.

-thomas

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] convergence of coxfilter and coxph

2007-05-22 Thread Thomas Lumley

On Mon, 21 May 2007, carol white wrote:

 Hi, coxfilter function in genefilter package uses coxph to fit a model 
 to filter genes. how come that coxfilter could converge to find a 
 solution in cox model fitting using a data matrix of 8000 variables and 
 600 samples but coxph doesn't converge with the same matrix?

coxfilter() fits 8000 one-variable models, which works (for appropriate 
values of works). coxph() refuses to fit one 8000-variable model.


-thomas

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Selecting complementary colours

2007-05-22 Thread Thomas Lumley

On Mon, 21 May 2007, John Fox wrote:

 In retrospect, I didn't specify the problem clearly: What I want to be able
 to do is to place text on a background of arbitrary (but known RGB) colour
 so that the text is legible. I guess that this is better described as a
 contrasting than a complementary colour.

Since luminance contrasts are necessary and sufficient for readable text, 
you could use white for dark colors and black for light colors.

Luminance is roughly proportional to  0.2*(R^2.4)+0.6*(G^2.4), suggesting 
something like

lightdark-function (color)
{
 rgb - col2rgb(color)/255
 L - c(0.2, 0.6, 0) %*% rgb
 ifelse(L = 0.2, #60, #A0)
}

This uses a pale yellow for dark backgrounds and a dark blue for light 
backgrounds, and it seems to work reasonably well.

-thomas

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] svychisq

2007-05-22 Thread Thomas Lumley

On Fri, 18 May 2007, Moss, Angela (Dudley PCT) wrote:

 Dear All

 I am trying to use svychisq with a two-dimensional table 4 x 5. The
 command I am using is
 summary(svytable(~dietperception+dietstatus,dudleyls1rake,na.rm=TRUE),C
 hisq)

 It is throwing up an error message as follows:

 Error in NCOL(y) : only 0's may be mixed with negative subscripts

I can't reproduce this problem at all. I've tried tables with zero cells, 
with and without raking. The na.rm= argument to svytable() can't be 
helping, since svytable() doesn't have an na.rm argument.

Does the same thing happen if you call svychisq() directly rather than via 
summary(svytable())?

-thomas

Thomas Lumley   Assoc. Professor, Biostatistics
[EMAIL PROTECTED]   University of Washington, Seattle

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] saving datafreame object problem

2007-05-22 Thread Thomas Lumley

On Tue, 22 May 2007, [EMAIL PROTECTED] wrote:

 Do I miss here something?


Yes.


 dtaa =
 read.table(http://www.ats.ucla.edu/stat/mplus/examples/ma_snijders/mlbook1.dat;,
 sep=,)

 head(dtaa)   # shows the data as it should be

 save(dtaa,dtaa,file=c:/dtaa)

 d = load(c:/dtaa)


From ?load
Value:

  A character vector of the names of objects created, invisibly.

So d is correct. Try ls() to find the loaded data.

-thomas

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] lapply not reading arguments from the correct environment

2007-05-18 Thread Thomas Lumley

On Fri, 18 May 2007, jiho wrote:

 Hello,

 I am facing a problem with lapply which I think''' may be a bug.
 This is the most basic function in which I can reproduce it:

 myfun - function()
 {
   foo = data.frame(1:10,10:1)
   foos = list(foo)
   fooCollumn=2
   cFoo = lapply(foos,subset,select=fooCollumn)
   return(cFoo)
 }

snip
 I get this error:
   Error in eval(expr, envir, enclos) : object fooCollumn not found
 while fooCollumn is defined, in the function, right before lapply.
snip
 This is with R 2.5.0 on both OS X and Linux (Fedora Core 6)
 What did I do wrong? Is this indeed a bug? An intended behavior?

No, it isn't a bug (though it may be confusing).

The problem is that subset() evaluates its select argument in an unusual 
way. Usually the argument would be evaluated inside myfun() and the value 
passed to lapply(), and everything would work as you expect.

subset() bypasses the normal evaluation and explicitly evaluates the 
select argument in the calling frame, ie, inside lapply(), where 
fooCollumn is not visible.

You could do
   lapply(foos, function(foo) subset(foo, select=fooCollum))
capturing fooCollum by lexical scope.  In R this is often a better option 
than passing extra arguments to lapply (or other functions that take 
function arguments).

-thomas

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Scoped options setting?

2007-05-17 Thread Thomas Lumley

On Wed, 16 May 2007, Zack Weinberg wrote:

 Is there any way to set options during the evaluation of a particular
 expression, with them automatically reset when control leaves that
 expression, however that happens?  Kind of like let on a special
 variable does in Lisp.  I naively tried


You could write with_options() as

with_options -function(optionlist, expr){
oldoptions-options(optionlist)
on.exit(options(oldoptions))
eval(substitute(expr), parent.frame())
}

and then do
   with_options(list(warn=-1),  whatever)

-thomas

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] [OT] Is data copyrightable?

2007-05-14 Thread Thomas Lumley


This is an area where US law differs importantly from other countries. US 
law protects compilations of facts only to the extent that the selection 
of the facts is creative expression (and does not protect the facts 
themselves).  Many other jurisdictions (eg European Union) also offer 
protection based on the effort need to compile the facts regardless of any 
creativity.  A 1997 US Supreme Court decision (in a case about telephone 
directories) ruled that the 'sweat of the brow' rationale for copyright 
was inconsistent with the intellectual property clause of the US 
Constitution.  So, in the US, it depends on the data and their source.

Publishers that I have talked to tend to claim that data are definitely 
copyrightable, but since they tend to own the copyrights one might do well 
to recall the immortal words of Mandy Rice-Davies.

-thomas

Thomas Lumley   Assoc. Professor, Biostatistics
[EMAIL PROTECTED]   University of Washington, Seattle

On Sat, 12 May 2007, hadley wickham wrote:

 Dear all,

 This is a little bit off-topic, but I was wondering if anyone has any
 informed opinion on whether data (ie. a dataset) is copyrightable?

 Hadley

 __
 R-help@stat.math.ethz.ch mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.


Thomas Lumley   Assoc. Professor, Biostatistics
[EMAIL PROTECTED]   University of Washington, Seattle

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] PRESS criterion in leaps

2007-05-14 Thread Thomas Lumley

On Fri, 11 May 2007, Andrew Smith wrote:
 I thought it would be simplest to build on already existing functions like
 regsubsets in package leaps.  It's easy enough to calculate the PRESS
 criterion for a fitted lm object, but I'm having trouble deciphering the
 structure of the regsubsets objects that leaps works with.  Is there any way
 to calculate press from a regsubsets?  Or, to put it another way, can I get
 the residual vector and the diagonal entries of the hat matrix from a
 regsubsets object?  In fact, if the hat matrix is never calculated
 explicitly, the columns of Q from the QR factorization would suffice.


Not only is the hat matrix never calculated explicitly, the Q matrix isn't 
calculated either.  The code forms R and Q^TY directly (the same code is 
used in the biglm package to provide bounded-space linear regression).

-thomas

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Mantel-Haenszel relative risk with Greenland-Robins variance estimate

2007-05-10 Thread Thomas Lumley

On Tue, 8 May 2007, Frank E Harrell Jr wrote:

 Does anyone know of an R function for computing the Greenland-Robins
 variance for Mantel-Haenszel relative risks?

Both the (almost identical) meta-analysis packages compute the MH 
estimator for relative risks and its standard error.

-thomas

Thomas Lumley   Assoc. Professor, Biostatistics
[EMAIL PROTECTED]   University of Washington, Seattle

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] power 2x3 exact test

2007-05-10 Thread Thomas Lumley

On Thu, 10 May 2007, [EMAIL PROTECTED] wrote:

 Given that you expect some cells to be small, it should not
 be a severe task to draw up a list of (a1,b1) values which
 correspond to rejection of the null hypothesis (that both
 ORs equal 1), and then the simulation using different values
 of the two odds-ratios will give you the power for each such
 pair of odds-ratios.

 The main technical difficulty will be simulation of random
 tables, conditional on the marginals, with the probabilities
 as given above.

 I don't know of a good suggestion for this.

r2dtable().

If this is a power calculation, though, you probably want to fix only one 
margin, which is a much simpler problem, and if the table is not too large 
it would not be difficult to compute the exact probability for each 
element of the sample space and so get exact power.

-thomas

Thomas Lumley   Assoc. Professor, Biostatistics
[EMAIL PROTECTED]   University of Washington, Seattle

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] reference in article

2007-05-02 Thread Thomas Lumley

On Wed, 2 May 2007, Tomas Mikoviny wrote:

 Hi all R positive,

 does anyone know how to refer R  in article?


Every time you start R it says (in part)

R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.

-thomas

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] GLS terminology question not related to R

2007-04-25 Thread Thomas Lumley

On Wed, 25 Apr 2007, Leeds, Mark (IED) wrote:

 This is a terminology question not related to R. The literature often
 says that OLS is inefficient relative to GLS if the residuals in
 the system are correlated ( and the RHS sides of each are not identical
 ). Does this mean that OLS overestimates residual and coefficient
 variances , underestimates them or just gets them wrong and the
 direction is not known ? Thanks.

It does not mean either.

It means that the true variance of the OLS estimates is greater than the 
true variance of the GLS estimates.

A separate issue is whether the estimated variance of an OLS estimator is 
greater or less than the true variance of the OLS estimator.  This can go 
either way.

-thomas

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] general question about plotting multiple regression results

2007-04-19 Thread Thomas Lumley

On Thu, 19 Apr 2007, Simon Pickett wrote:

 Hi all,

 I have been bumbling around with r for years now and still havent come up
 with a solution for plotting reliable graphs of relationships from a
 linear regression.

termplot() does this for a range of regression models (without interaction 
terms). The effects package does it better for linear regression models.

-thomas


 Here is an example illustrating my problem

 1.I do a linear regression as follows

 summary(lm(n.day13~n.day1+ffemale.yell+fmale.yell+fmale.chroma,data=surv))

 which gives some nice sig. results

 Coefficients:
 Estimate Std. Error t value Pr(|t|)
 (Intercept)  -0.739170.43742  -1.690 0.093069 .
 n.day11.004600.05369  18.711   2e-16 ***
 ffemale.yell  0.224190.06251   3.586 0.000449 ***
 fmale.yell0.258740.06925   3.736 0.000262 ***
 fmale.chroma  0.235250.11633   2.022 0.044868 *

 2. I want to plot the effect of ffemale.yell, fmale.yell and
 fmale.chroma on my response variable.

 So, I either plot the raw values (which is fine when there is a very
 strong relationship) but what if I want to plot the effects from the
 model?

 In this case I would usually plot the fitted values values against the raw
 values of x... Is this the right approach?

 fit-fitted(lm(n.day13~n.day1+ffemale.yell+fmale.yell+fmale.chroma,data=fsurv1))

 plot(fit~ffemale.yell)

 #make a dummy variable across the range of x
 x-seq(from=min(fsurv1$ffemale.yell),to=max(fsurv1$ffemale.yell), length=100)

 #get the coefficients and draw the line
 co-coef(lm(fit~ffemale.yell,data=fsurv1))
 y-(co[2]*x)+co[1]
 lines(x,y, lwd=2)

 This often does the trick but for some reason, especially when my model
 has many terms in it or when one of the independent variables is only
 significant when the other independent variables are in the equation, it
 gives me strange lines.

 Please can someone show me the light?

 Thanks in advance,

 Simon.






 Simon Pickett
 PhD student
 Centre For Ecology and Conservation
 Tremough Campus
 University of Exeter in Cornwall
 TR109EZ
 Tel 01326371852

 __
 R-help@stat.math.ethz.ch mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.


Thomas Lumley   Assoc. Professor, Biostatistics
[EMAIL PROTECTED]   University of Washington, Seattle

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] help comparing two median with R

2007-04-17 Thread Thomas Lumley

On Tue, 17 Apr 2007, Robert McFadden wrote:

 -Original Message-
 From: [EMAIL PROTECTED]
 [mailto:[EMAIL PROTECTED] On Behalf Of Jim Lemon
 Sent: Tuesday, April 17, 2007 12:37 PM
 To: Pedro A Reche
 Cc: r-help@stat.math.ethz.ch
 Subject: Re: [R] help comparing two median with R

 Pedro A Reche wrote:
 Dear R users,
 I am new to R and  I would like to ask your help with the following
 topic. I have three sets of numeral data, 2 sets are paired and a
 third is independent of the other two. For each of these sets I have
 obtained their basic statistics (mean, median, stdv, range ...).
 Now I want to compare if these sets differ. I could compare
 the mean
 doing a basic T test . However, I was looking for a test to compare
 the medians using R.   If that is possible I would love to
 hear the
 specifics.

 Hi Pedro,
 You can use the Mann-Whitney test (wilcox with two
 samples), but you would have to check that the second and
 third moments of the variable distributions were the same, I think.

 Jim
 Use Mann-Whitney U test, but remember about 2 assumption:
 1. samples come from continuous distribution (there are no tied
 obserwations)
 2. distributions are identical in shape. It's very similar to t-test but
 Mann-Whitney U test is not as affected by violation of the homogeneity of
 variance assumption as t-test is.

This turns out not to be quite correct.

If the two distributions differ only by a location shift then the 
hypothesis that the shift is zero is equivalent to the medians being the 
same (or the means, or the 3.14159th percentile), and the Mann-Whitney U 
test will test this hypothesis. Otherwise the Mann-Whitney U test does not 
test for equal medians.

The assumption that the distributions are continuous is for convenience -- 
it makes the distribution of the test statistic easier to calculate and 
otherwise R uses a approximation.  The assumption of a location shift is 
critical -- otherwise it is easy to construct three data sets x,y,z so 
that the Mann-Whitney U test thinks x is larger than y, y is larger than z 
and z is larger than x (Google for Efron Dice). That is, the Mann-Whitney 
U test cannot be a test for any location statistic.

There actually is an exact test for the median that does not assume a 
location shift:  dichotomize your data at the pooled median to get a 2x2 
table of above/below median by group, and do Fisher's exact test on the 
table.  This is almost never useful (because it doesn't come with an 
interval estimate), but is interesting because it (and the generalizations 
to other quantiles) is the only exactly distribution-free location test 
that does not have the 'non-transitivity' problem of the Mann-Whitney U 
test.  I believe this median test is attributed to Mood, but I have not 
seen the primary source.

-thomas

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] help comparing two median with R

2007-04-17 Thread Thomas Lumley

On Tue, 17 Apr 2007, Frank E Harrell Jr wrote:

 The points that Thomas and Brian have made are certainly correct, if one is 
 truly interested in testing for differences in medians or means.  But the 
 Wilcoxon test provides a valid test of x  y more generally.  The test is 
 consonant with the Hodges-Lehmann estimator: the median of all possible 
 differences between an X and a Y.


Yes, but there is no ordering of distributions (taken one at a time) that 
agrees with the Wilcoxon two-sample test, only orderings of pairs of 
distributions.

The Wilcoxon test provides a test of xy if it is known a priori that the 
two distributions are stochastically ordered, but not under weaker 
assumptions.  Otherwise you can get xyzx. This is in contrast to the 
t-test, which orders distributions (by their mean) whether or not they are 
stochastically ordered.

Now, it is not unreasonable to say that the problems are unlikely to occur 
very often and aren't worth worrying too much about. It does imply that it 
cannot possibly be true that there is any summary of a single distribution 
that the Wilcoxon test tests for (and the same is true for other 
two-sample rank tests, eg the logrank test).

I know Frank knows this, because I gave a talk on it at Vanderbilt, but 
most people don't know it. (I thought for a long time that the Wilcoxon 
rank-sum test was a test for the median pairwise mean, which is actually 
the R-estimator corresponding to the *one*-sample Wilcoxon test).


-thomas

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] software comparison

2007-04-16 Thread Thomas Lumley

On Mon, 16 Apr 2007, Philippe Grosjean wrote:

 Hello,

 I just read this paper, and was surprised by:

 This study performs a comparison of the latest versions of nine
 different statistical software packages using StRD. These packages are
 SAS (9.1), SPSS (12.0), Excel (2003), Minitab (14.0), Stata (8.1), Splus
 (6.2), R (1.9.1), JMP (5.0), and StatCrunch (3.0).

 For a paper published in 2007, and submitted in April 2005, this is
 still surprising. If I my calculation is correct, in 2004, they would
 have used R 2.2.x, or something,... not 1.9.1?

No -- there is a new x.y.0 twice per year. Checking the r-announce 
archives shows that 2.0.0 came out in October 2004 and 1.9.1 in June 2004.

Describing 1.9.1 as the latest version was inaccurate, but it may well 
have been a more recent version than for most of the  other packages they 
examined.

-thomas

Thomas Lumley   Assoc. Professor, Biostatistics
[EMAIL PROTECTED]   University of Washington, Seattle

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] read.spss (package foreign) and SPSS 15.0 files

2007-04-11 Thread Thomas Lumley

On Thu, 5 Apr 2007, John Kane wrote:
 Heck. I'd be happy to get an answer to what is
 happening here:
 mac - spss.get(H:/ONTH/Raw.data/Follow.sav)
 Warning message:
 H:/ONTH/Raw.data/Follow.sav: Unrecognized record type
 7, subtype 16 encountered in system file


It means that your file had a record of type 7, subtype 16 in it, and 
read.spss doesn't know how to handle these.

You would have to ask SPSS what record type 7 and subtype 16 represent -- 
their software put them there, and it's their terminology.

People's experience with unrecognised record types is that they usually 
don't matter, which would make sense from a backwards-compatibility point 
of view, but in the absence of documentation or psychic powers it is hard 
to be sure.  Avoiding read.spss is a perfectly reasonable strategy, and is 
in fact what we have always recommended in the Data Import-Export manual.

AFAIK the only commercial statistical software vendor that does provide 
complete, public documentation of their file formats is Stata, and this 
is one reason why there are fewer complaints about read.dta and write.dta. 
It also probably helps that the code was written by someone who uses Stata 
-- there hasn't been much contribution of code or patches for the 
foreign package from SPSS users.


-thomas

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] read.spss (package foreign) and SPSS 15.0 files

2007-04-11 Thread Thomas Lumley

On Thu, 5 Apr 2007, John Kane wrote:
 Heck. I'd be happy to get an answer to what is
 happening here:
 mac - spss.get(H:/ONTH/Raw.data/Follow.sav)
 Warning message:
 H:/ONTH/Raw.data/Follow.sav: Unrecognized record type
 7, subtype 16 encountered in system file


It means that your file had a record of type 7, subtype 16 in it, and 
read.spss doesn't know how to handle these.

You would have to ask SPSS what record type 7 and subtype 16 represent -- 
their software put them there, and it's their terminology.

People's experience with unrecognised record types is that they usually 
don't matter, which would make sense from a backwards-compatibility point 
of view, but in the absence of documentation or psychic powers it is hard 
to be sure.  Avoiding read.spss is a perfectly reasonable strategy, and is 
in fact what we have always recommended in the Data Import-Export manual.

AFAIK the only commercial statistical software vendor that does provide 
complete, public documentation of their file formats is Stata, and this 
is one reason why there are fewer complaints about read.dta and write.dta. 
It also probably helps that the code was written by someone who uses Stata 
-- there hasn't been much contribution of code or patches for the 
foreign package from SPSS users.


-thomas

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Using Sampling Weights in R

2007-04-11 Thread Thomas Lumley

On Tue, 10 Apr 2007, Thomas W. Volscho wrote:

 Dear List,
 I have a dataset that provides sampling weights (National Survey of
 Family Growth 2002).  I want to produce a cross-tabulation and use the
 provided sampling weights to obtain representative population estimates.
 (I believe they are simply frequency weights but codebook is
 uninformative).

They are almost certainly not simply frequency weights -- the NCHS web 
page on this survey describes a multistage sampling scheme and gives code 
examples for survey software using the design features.  If you only want 
point estimates and no intervals or p-values then it doesn't matter what 
type of weights they are.

 I can reproduce results (using this data) that were reported in a recent
 journal article, if I use SPSS or STATA--specifying the weight.

The survey package does analysis of surveys of this sort.

Judging from the Stata example at 
http://www.cdc.gov/nchs/data/nsfg/Ser2_Example1_FINAL.pdf
  if the data are in a data frame called 'nsfg' you would create a survey 
design object with

   dnsfg - svydesign(id=~SECU_R, weight=~FINALWGT, strata=~SEST,
 data=nsfg)

and then you could get crosstabs with,eg,

   svytable(~agerx+pill, design=dnsfg)



-thomas

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Reasons to Use R

2007-04-11 Thread Thomas Lumley

On Wed, 11 Apr 2007, Alan Zaslavsky wrote:
 I have thought for a long time that a facility for efficient rowwise
 calculations might be a valuable enhancement to S/R.  The storage of the
 object would be handled by a database and there would have to be an
 efficient interface for pulling a row (or small chunk of rows) out of the
 database repeatedly; alternatively the operatons could be conducted inside
 the database.  Basic operations of rowwise calculation and cumulation
 (such as forming a column sum or a sum of outer-products) would be
 written in an R-like syntax and translated into an efficient set of
 operations that work through the database.  (Would be happy to share
 some jejeune notes on this.)  However the main answer to thie problem
 in the R world seems to have been Moore's Law.  Perhaps somebody could
 tell us more about the S-Plus large objects library, or the work that
 Doug Bates is doing on efficient calculations with large datasets.



I have been surprised to find how much you can get done in SQL, only 
transferring summaries of the data into R.  There is soon going to be an 
experimental surveyNG package that works with survey data stored in a SQLite 
database without transferring the whole thing into R for most operations (and I 
could get further if SQLite had the log() and exp() functions that most other 
SQL implementations for large databases provide). I'll be submitting a paper on 
this to useR2007.

The approach of transferring blocks of data into R and using a database just as 
backing store will allow more general computation but will be less efficient 
than performing the computation in the database, so a mixture of both is likely 
to be helpful.  Moore's Law will settle some issues, but there are problems 
where it is working to increase the size of datasets just as fast as it 
increases computational power.


 -thomas

Thomas Lumley   Assoc. Professor, Biostatistics
[EMAIL PROTECTED]   University of Washington, Seattle

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] read.spss (package foreign) and SPSS 15.0 files

2007-04-06 Thread Thomas Lumley

On Thu, 5 Apr 2007, John Kane wrote:
 Heck. I'd be happy to get an answer to what is
 happening here:
 mac - spss.get(H:/ONTH/Raw.data/Follow.sav)
 Warning message:
 H:/ONTH/Raw.data/Follow.sav: Unrecognized record type
 7, subtype 16 encountered in system file


It means that your file had a record of type 7, subtype 16 in it, and 
read.spss doesn't know how to handle these.

You would have to ask SPSS what record type 7 and subtype 16 represent -- 
their software put them there, and it's their terminology.

People's experience with unrecognised record types is that they usually 
don't matter, which would make sense from a backwards-compatibility point 
of view, but in the absence of documentation or psychic powers it is hard 
to be sure.  Avoiding read.spss is a perfectly reasonable strategy, and is 
in fact what we have always recommended in the Data Import-Export manual.

AFAIK the only commercial statistical software vendor that does provide 
complete, public documentation of their file formats is Stata, and this 
is one reason why there are fewer complaints about read.dta and write.dta. 
It also probably helps that the code was written by someone who uses Stata 
-- there hasn't been much contribution of code or patches for the 
foreign package from SPSS users.


-thomas

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] unexpected behavior when creating a list of functions

2007-04-04 Thread Thomas Lumley

On Wed, 4 Apr 2007, Matthew Suderman wrote:

 I wanted to create a list of functions whose output differs depending 
 the value of a variable when the function was created.  Generally this 
 did not work.  Each function was exactly the same, as in the simple 
 example below:

 get_data_function - function(v) {
  function() {
print(v)
  }
 }
 data_functions - sapply(1:10,function(v) get_data_function(v))
 (data_functions[[1]])() # prints 10!

 However, if I insert a statement in get_data_function to print argument 
 v, then I get the different functions that I wanted:

 get_data_function - function(v) {
  print(v)
   function() {
 print(v)
   }
 }
 data_functions - sapply(1:10,function(v) get_data_function(v))
 (data_functions[[1]])() # prints 1, as expected!

 I have two questions about this:
 * Is this a bug in R?

No, it's lazy evaluation at work.

 * Is there a more direct way to get the effect of printing v?

The function force() is designed for this -- it doesn't do anything 
special that any other evaluation wouldn't do, but it makes clear that all 
you are doing is forcing evaluation.

-thomas

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] strange fisher.test result

2007-04-03 Thread Thomas Lumley

On Mon, 2 Apr 2007, [EMAIL PROTECTED] wrote:

 From the above, the marginal totals for his 2x2 table

  a   b   =   168

  c   d   15   24

 are (rows then columns) 24,39,31,32

 These fixed marginals mean that the whole table is determined
 by the value of a. The following function P.FX() computes the
 probabilities of all possible tables, conditional on the
 marginal totals (it is much more transparent than the code
 for the same purpose in fisher.test()):

As this example has shown, 2x2 tables are a nice opportunity for 
illustrating how the ordering of the sample space affects inference 
(because you can actually see the whole sample space).

I used something like this as a term project in an introductory R class, 
where we wrote code to compute the probabilities for all outcomes 
conditional on one margin, and used this to get (conservative) exact 
versions of all the popular tests in 2x2 tables.  It's interesting to do 
things like compare the rejection regions and power under various 
alternatives for the exact versions of the likelihood ratio test and 
Fisher's test.  We didn't get as far as confidence intervals, but the code 
is at
http://faculty.washington.edu/tlumley/b514/exacttest.R
with .Rd files at
http://faculty.washington.edu/tlumley/b514/man/

[credits: this is all based on ideas from Scott Emerson]


-thomas

Thomas Lumley   Assoc. Professor, Biostatistics
[EMAIL PROTECTED]   University of Washington, Seattle

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Tail area of sum of Chi-square variables

2007-03-29 Thread Thomas Lumley

On Thu, 29 Mar 2007, Klaus Nordhausen wrote:

 I was wondering if there are any R functions that give the tail area
 of a sum of chisquare distributions of the type:
 a_1 X_1 + a_2 X_2
 where a_1 and a_2 are constants and X_1 and X_2 are independent chi-square 
 variables with different degrees of freedom.


pchisqsum() in the survey package does this.

-thomas

Thomas Lumley   Assoc. Professor, Biostatistics
[EMAIL PROTECTED]   University of Washington, Seattle

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Tail area of sum of Chi-square variables

2007-03-29 Thread Thomas Lumley


The Satterthwaite approximation is surprisingly good, especially in the 
most interesting range in the right tail (say 0.9 to 0.999). There is also 
another, better, approximation with a power of a chi-squared distribution 
that has been used in the survey literature.

However, since it is easy to write down the characteristic function and 
perfectly feasible to invert it by numerical integration, we might as well 
use the right answer.

-thomas

On Thu, 29 Mar 2007, S Ellison wrote:
 I was wondering if there are any R functions that give the tail area
 of a sum of chisquare distributions of the type:
 a_1 X_1 + a_2 X_2
 where a_1 and a_2 are constants and X_1 and X_2 are independent
 chi-square variables with different degrees of freedom.

 You might also check out Welch and Satterthwaite's (separate) papers on 
 effective degrees of freedom for compound estimates of variance, which led to 
 a thing called the welch-satterthwaite equation by one (more or less 
 notorious, but widely used) document called the ISO Guide to Expression of 
 Uncertainty in Measurement (ISO, 1995). The original papers are
 B. L. Welch, J. Royal Stat. Soc. Suppl.(1936)  3 29-48
 B. L. Welch, Biometrika, (1938) 29 350-362
 B. L. Welch, Biometrika, (1947) 34 28-35

 F. E. Satterthwaite, Psychometrika (1941) 6 309-316
 F. E. Satterthwaite, Biometrics Bulletin, (1946) 2 part 6 110-114

 The W-S equation - which I believe is a special case of Welch's somewhat more 
 general treatment - says that if you have multiple independent estimated 
 variances v[i] (could be more or less equivalent to your a_i X_i?) with 
 degrees of freedom nu[i], the distribution of their sum is approximately a 
 scaled chi-squared distribution with effective degrees of freedom 
 nu.effective given by

 nu.effective =  sum(v[i])^2 / sum((v[i]^2)/nu[i] )

 If I recall correctly, with an observed variance s^2 (corresponding to the 
 sum(v[i] above if those are observed varianes), nu*(s^2 /sigma^2) is 
 distributed as chi-squared with degrees of freedom nu, so the scaling factor 
 for quantiles would come out of there (depending whether you're after the 
 tail areas for s^2 given sigma^2 or for a confidence interval for sigma^2 
 given s^2)

 However, I will be most interested to see what a more exact calculation 
 provides!

 Steve Ellison


 ***
 This email and any attachments are confidential. Any use, co...{{dropped}}

 __
 R-help@stat.math.ethz.ch mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.


Thomas Lumley   Assoc. Professor, Biostatistics
[EMAIL PROTECTED]   University of Washington, Seattle

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] what is the difference between survival analysis and (...)

2007-03-28 Thread Thomas Lumley

On Wed, 28 Mar 2007, Frank E Harrell Jr wrote:

 Eric Elguero wrote:
 Hi everybody,

 recently I had to teach a course on Cox model, of which I am
 not a specialist, to an audience of medical epidemiologists.
 Not a good idea you might say.. anyway, someone in the
 audience was very hostile. At some point, he sayed that
 Cox model was useless, since all you have to do is count
 who dies and who survives, divide by the sample sizes
 and compute a relative risk, and if there was significant
 censoring, use cumulated follow-up instead of sample
 sizes and that's it!
 I began arguing that in Cox model you could introduce
 several variables, interactions, etc, then I remembered
 of logistic models ;-)
 The only (and poor) argument I could think of was that
 if mr Cox took pains to devise his model, there should
 be some reason...

 That is a very ignorant person, concerning statistical
 efficiency/power/precision and how to handle incomplete follow-up
 (variable follow-up duration).  There are papers in the literature (I
 wish I had them at my fingertips) that go into the efficiency loss of
 just counting events.  If the events are very rare, knowing the time
 doesn't help as much, but the Cox model still can handle censoring
 correctly and that person's approach doesn't.


Certainly just counting the events is inefficient -- the simplest example would 
be studies of some advanced cancers where nearly everyone dies during followup, 
so that there is little or no censoring but simple counts are completely 
uninformative.

It's relatively hard to come up with an example where using the 
total-time-on-test (rather than sample size) as a denominator is much worse 
than the Cox mode, though. You need the baseline hazard to vary a lot over time 
and the censoring patterns to be quite different in the groups, but 
proportional hazards to still hold.

I think the advantages of the Cox model over a reasonably sensible person-time 
analysis are real, but not dramatic -- it would be hard to find a data set that 
would convince the sort of person who would make that sort of claim.

I would argue that computational convenience on the one hand, and the ability 
to exercise lots of nice mathematical tools on the other hand have also 
contributed to the continuing popularity of the Cox model.


 -thomas

Thomas Lumley   Assoc. Professor, Biostatistics
[EMAIL PROTECTED]   University of Washington, Seattle

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Replacement in an expression - can't use parse()

2007-03-27 Thread Thomas Lumley

On Tue, 27 Mar 2007, Peter Dalgaard wrote:

The way around this is to add a further layer of substitute() to insert
the value of e:

 eval(substitute(substitute(call,list(u2=quote(x),u3=1)),list(call=e[[1]])))
 u1 + x + 1

Or eval(do.call(substitute, list(e[[1]], list(u2=quote(x),u3=1)))

-thomas

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] subset

2007-03-26 Thread Thomas Lumley

On Mon, 26 Mar 2007, Marc Schwartz wrote:

 Sergio,

 Please be sure to cc: the list (ie. Reply to All) with follow up
 questions.

 In this case, you would use %in% with a negation:

 NewDF - subset(DF, (var1 == 0)  (var2 == 0)  (!var3 %in% 2:3))


Probably a typo: should be !(var3 %in% 2:3) rather than (!var %in% 2:3)

  -thomas

 See ?%in% for more information.

 HTH,

 Marc

 On Mon, 2007-03-26 at 17:30 +0200, Sergio Della Franca wrote:
 Ok, this run correctly.

 Another question for you:

 I want to put more than one condition for var3, i.e.:
 I like to create a subset when:
  - var1=0
  - var2=0
  - var3 is different from 2 and from 3.

 Like you suggested, i perform this code:
 NewDF - subset(DF, (var1 == 0)  (var2 == 0)  (var 3 != 2))  (var
 3 != 3))

 There is a method to combine (var 3 != 2))  (var 3 != 3)) in one
 condition?

 Thank you.

 Sergio



 2007/3/26, Marc Schwartz [EMAIL PROTECTED]:
 On Mon, 2007-03-26 at 17:02 +0200, Sergio Della Franca wrote:
 Dear R-Helpers,

 I want to make a subset from my data set.

 I'd like to perform different condition for subset.

 I.e.:

 I like to create a subset when:
 - var1=0
 - var2=0
 - var3 is different from 2.

 How can i develop a subset under this condition?

 Thank you in advance.

 Sergio Della Franca.

 See ?subset

 Something along the lines of the following should work:

 NewDF - subset(DF, (var1 == 0)  (var2 == 0)  (var 3 != 0))

 HTH,

 Marc Schwartz

 __
 R-help@stat.math.ethz.ch mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.


Thomas Lumley   Assoc. Professor, Biostatistics
[EMAIL PROTECTED]   University of Washington, Seattle

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Hardware for a new Workstation for best performance using R

2007-03-19 Thread Thomas Lumley

On Thu, 15 Mar 2007, Andrew Perrin wrote: (in part)

 2.) Yes, by all means you should use linux instead of windows. The
 graphics output is completely compatible with whatever applications you
 want to paste them into on Windows.

This turns out not to be the case.

It is not trivial to produce good graphics off Windows for adding to 
Microsoft Office documents (regrettably an important case for many 
people).  There has been much discussion of this on the R-sig-mac mailing 
list, for example, where PNG bitmaps (at sufficiently high resolution) 
seem to be the preferred method.

-thomas

Thomas Lumley   Assoc. Professor, Biostatistics
[EMAIL PROTECTED]   University of Washington, Seattle

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Hardware for a new Workstation for best performance using R

2007-03-19 Thread Thomas Lumley

On Mon, 19 Mar 2007, Gabor Grothendieck wrote:

 On 3/19/07, Thomas Lumley [EMAIL PROTECTED] wrote:
 On Thu, 15 Mar 2007, Andrew Perrin wrote: (in part)

 2.) Yes, by all means you should use linux instead of windows. The
 graphics output is completely compatible with whatever applications you
 want to paste them into on Windows.

 This turns out not to be the case.

 It is not trivial to produce good graphics off Windows for adding to
 Microsoft Office documents (regrettably an important case for many
 people).  There has been much discussion of this on the R-sig-mac mailing
 list, for example, where PNG bitmaps (at sufficiently high resolution)
 seem to be the preferred method.

 On Windows one can produce metafile output directly from R.

Yes, indeed. However, this fact is of limited help when working on another 
operating system, which was the focus of the original question.

-thomas

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Mac vs. PC

2007-03-10 Thread Thomas Lumley

On Fri, 9 Mar 2007, Richard Morey wrote:
 1. R on Mac was compiled with optimizations for the CPU, with R for
 Windows was not. I could test this by compiling R with the Intel
 compiler, or GCC with optimizations, and seeing if I get a significant
 speed boost.

Yes.  The Mac distribution uses Apple's linear algebra library, which is 
based on ATLAS and uses both cores.  The default Windows distribution 
doesn't use an optimized linear algebra library because there isn't one 
built in to Windows.  You can use ATLAS with the Windows distribution and 
there are even precompiled DLLs around somewhere.

 2. His R is 64 bit, while mine is for 32 bit windows. (I'm not sure how
 much of a diference that makes, or whether OSX is 64 bit.)

No.

His R isn't 64bit.  It would probably be slower if it were. The main 
reason to want 64bit R is to use lots of memory rather than to be fast.

 3. Data is getting swapped to the hard drive, and my hard drive is
 slower than his. I chose a slower hard drive to get bigger capacity for
 the price.

This could be true in principle, but I don't think the matrices are large 
enough for it to be the main factor.


His computer won't be twice as fast on most R tasks (though it will still 
be twice as pretty, of course).

-thomas

Thomas Lumley   Assoc. Professor, Biostatistics
[EMAIL PROTECTED]   University of Washington, Seattle

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] GLM: order of terms in model

2007-03-09 Thread Thomas Lumley


This is a FAQ

7.18 Why does the output from anova() depend on the order of factors in 
the model?

-thomas

On Fri, 9 Mar 2007, Christian Landry wrote:

 Dear R-helpers,

 I have been analysing data using a GLM. My model is as follows:

 mod - glm (V ~ T + as.factor(A) + N, family=gaussian)

 and using

 anova(mod, test=F)

 to get the analysis of deviance table and the fraction of deviance
 explained by each term.

 T and A dominate with respect to their Deviance, with T having a
 larger effect than A (about twice)

 However, if I reverse T and A in the model, I get that A now explains
 more deviance than T.

 My questions are: 1) What is it due to?
   2) Is there any way around this? How do I find 
 which model is
 best and/or can I use another method that won't be sensitive to the
 order of the terms.

 Thanks,

 Christian

 Reply to: [EMAIL PROTECTED]

 __
 R-help@stat.math.ethz.ch mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.


Thomas Lumley   Assoc. Professor, Biostatistics
[EMAIL PROTECTED]   University of Washington, Seattle

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Memory error

2007-03-08 Thread Thomas Lumley

On Thu, 8 Mar 2007, Andrew Perrin wrote:

 Greetings-

 Running R 2.4.0 under Debian Linux, I am getting a memory error trying to
 read a very large file:

 library(foreign)
 oldgrades.df - read.spss('Individual grades with AI (Nov 7 
 2006).sav',to.data.frame=TRUE)
 Error: cannot allocate vector of size 10826 Kb

Your file on disk seems to be about 300Mb, and it might well be larger in 
R, so it's probably too big for 32-bit R.

However, you could try to.data.frame=FALSE in the read.spss() call. Based 
on memory profiling of the fairly similar read.dta() function I would 
guess that as.data.frame.list() might well be the culprit.

-thomas

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Mitools and lmer

2007-03-05 Thread Thomas Lumley


Doug,

It's mitools, not mltools. I wrote it.

I think the problem is just that coef() is not the right function for 
getting the fixed effects.  Beth wants
   betas - MIextract(model0, fun=fixef)

-thomas



On Sat, 3 Mar 2007, Douglas Bates wrote:

 On 3/2/07, Beth Gifford [EMAIL PROTECTED] wrote:
 Hey there
 I am estimating a multilevel model using lmer.  I have 5 imputed datasets so
 I am using mitools to pool the estimates from the 5
 datasets.  Everything seems to work until I try to use
 MIcombine to produced pooled estimates.  Does anyone have any suggestions?  
 The betas and the standard errors were extracted with no problem so 
 everything seems to work smoothly up until that point.

 I'm not familiar with the mltools package and I didn't see it listed
 in the CRAN packages.  Can you provide a reference or a link to the
 package?

 Program
 #Read data
 data.dir-system.file(dta,package=mitools)
 files.imp-imputationList(lapply(list.files(data.dir,
 pattern=imp.\\.dta, full=TRUE), read.dta))

 #estimate model over each imputed dataset
 model0-with(files.imp,lmer( erq2tnc ~1+trt2+nash+wash+male+coh2+coh3+(1 |
 sitebeth)))
 #extract betas and standard errors
 betas-MIextract(model0,fun=coef)
 vars-MIextract(model0,fun=vcov)
 #Combine the results
 summary(MIcombine(betas,vars))

 Error in cbar + results[[i]] : non-numeric argument to binary operator
 Error in summary(MIcombine(betas, vars)) :
 error in evaluating the argument 'object' in selecting a method for
 function 'summary'

 First use traceback() to discover where the (first) error occurred.
 My guess is that Mlcombine expects a particular type of object for the
 vars argument and it is not getting that type (and not checking for
 the correct type).




 Thanks
 Beth

 [[alternative HTML version deleted]]

 __
 R-help@stat.math.ethz.ch mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.


 __
 R-help@stat.math.ethz.ch mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.


Thomas Lumley   Assoc. Professor, Biostatistics
[EMAIL PROTECTED]   University of Washington, Seattle

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Timings of function execution in R [was Re: R in Industry]

2007-02-09 Thread Thomas Lumley


On 2/9/07, Prof Brian Ripley [EMAIL PROTECTED] wrote:
 The other reason why pmin/pmax are preferable to your functions is that
 they are fully generic.  It is not easy to write C code which takes into
 account that , [, [- and is.na are all generic.  That is not to say that
 it is not worth having faster restricted alternatives, as indeed we do
 with rep.int and seq.int.

 Anything that uses arithmetic is making strong assumptions about the
 inputs.  It ought to be possible to write a fast C version that worked for
 atomic vectors (logical, integer, real and character), but is there
 any evidence of profiled real problems where speed is an issue?


I had an example just last month of an MCMC calculation where profiling showed 
that pmax(x,0) was taking about 30% of the total time.  I used

 function(x) {z - x0; x[z] - 0; x}

which was significantly faster. I didn't try the arithmetic solution. Also, I 
didn't check if a solution like this would still be faster when both arguments 
are vectors (but there was a recent mailing list thread where someone else did).


  -thomas

Thomas Lumley   Assoc. Professor, Biostatistics
[EMAIL PROTECTED]   University of Washington, Seattle

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] R in Industry

2007-02-07 Thread Thomas Lumley

On Tue, 6 Feb 2007, Muenchen, Robert A (Bob) wrote:

 That sounds like a good idea. The name R makes it especially hard to
 find job postings, resumes or do any other type of search. Googling
 resume+sas or job opening+sas is quick and fairly effective (less a
 few airline jobs). Doing that with R is of course futile. At the risk of
 getting flamed, it's too bad it's not called something more unique such
 as Rpackage, Rlanguage, etc.

For all sorts of reasons I don't think Googling for jobs using R was high 
on Ross  Robert's list of use cases when they chose the name ...

It might be better to have an archived list rather than a CRAN page -- 
I've just noticed that cran.us last updated on Jan 12, which would be a 
long delay for job ads.

-thomas

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Two ways to deal with age in Cox model

2007-02-05 Thread Thomas Lumley

On Mon, 5 Feb 2007, John Sorkin wrote:

 When running a cox proportional hazards model ,there are two ways to
 deal with age,
 including age as a covariate, or to include age as part of the
 follow-up time, viz,
snip
 I would appreciate any thoughts about the differences in the
 interpretation of the two models.
 One obvious difference is that in the first model (fitagecovariate) one
 can make inferences about age and in the second one cannot. I think a
 second
 difference may be that in the first model the riskfactor is assumed to
 have values measured at the values of age where as in the second model
 riskfactor is assumed to have given values throughout the subject's
 life.

There are even more possibilities (a nice example and discussion is in 
Breslow  Day, the example being occupational exposure to nickel and 
later cancer).

The Cox model works by comparing covariates for the observation that 
failed and other observations at risk at the same time, so the comparisons 
are entirely within time-point.

If you use time since start of study you are comparing people with 
different covariates at the same time since start of study.

If you use calendar time you are comparing people with different 
covariates at the same calendar time

If you use age you are comparing people with different covariates at the 
same age.


In an observational study it often is more important to control for age or 
for calendar time than for time since the study started, so these might be 
better time scales.  A disadvantage in some studies with longitudinal data 
is that on the study time scale everyone may have measurements at the same 
time but on other time scales everyone may have measurements at different 
times.


-thomas

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] read.spss and encodings

2007-02-05 Thread Thomas Lumley



Thanks for the example file.  I'm trying to work out what, if anything, is 
easy to support reliably for encodings in SPSS.

-thomas


On Sun, 4 Feb 2007, I. Soumpasis wrote:

 HI!

 This mail is related to Thomas mail so I follow up.

 I use Greek language and the spss files with value labels containing greek
 characters can not be imported with read.spss.

 I am on:

 sessionInfo()
 R version 2.5.0 Under development (unstable) (2007-02-01 r40632)
 i686-pc-linux-gnu

 locale:
 LC_CTYPE=el_GR.UTF-8;LC_NUMERIC=C;LC_TIME=el_GR.UTF-8;LC_COLLATE=el_GR.UTF-8;LC_MONETARY=el_GR.UTF-8;LC_MESSAGES=el_GR.UTF-8;LC_PAPER=el_GR.UTF-8;LC_NAME=el_GR.UTF-8;LC_ADDRESS=el_GR.UTF-8;LC_TELEPHONE=el_GR.UTF-8;LC_MEASUREMENT=el_GR.UTF-8;LC_IDENTIFICATION=el_GR.UTF-8


 The following files are small examples used below:
 http://users.forthnet.gr/the/isoumpasis/data/1.sav
 http://users.forthnet.gr/the/isoumpasis/data/12.savhttp://users.forthnet.gr/the/isoumpasis/data/12.RData

 The first file has english value labels and can be read:
 read.spss(~/Desktop/1.sav)
 $VAR1
 [1] \xf3\xf0\xdf\xf4\xe9\xf3\xf0\xdf\xf4\xe9 
 [3] \xf3\xf0\xdf\xf4\xe9\xf3\xf0\xdf\xf4\xe9 
 [5] \xf3\xf0\xdf\xf4\xe9\xe3\xf1\xe1\xf6\xe5\xdf\xef   
 [7] \xe3\xf1\xe1\xf6\xe5\xdf\xef\xe3\xf1\xe1\xf6\xe5\xdf\xef   
 [9] \xe3\xf1\xe1\xf6\xe5\xdf\xef\xf3\xf0\xdf\xf4\xe9 
 [11] \xe3\xf1\xe1\xf6\xe5\xdf\xef   

 $VAR2
 [1] 5 6 7 7 5 7 3 5 6 7 8

 attr(,label.table )
 attr(,label.table)$VAR1
 NULL

 attr(,label.table)$VAR2
 NULL

 I can then convert the characters to greek using Thomas' code, so there is
 no problem here.

 In file 12.sav the value labels are greek. The problem is that the file
 cannot be read.

 read.spss(~/Desktop/12.sav)
 Error in read.spss(~/Desktop/12.sav) : error reading system-file header
 In addition: Warning message:
 ~/Desktop/12.sav: position 0: Variable name begins with invalid character

 I also tried using use.value.labels=FALSE having the same message.

 read.spss(~/Desktop/12.sav, use.value.labels=FALSE)
 Error in read.spss(~/Desktop/12.sav, use.value.labels = FALSE) :
error reading system-file header
 In addition: Warning message:
 ~/Desktop/12.sav: position 0: Variable name begins with invalid character

 The encoding of the spss files is windows-1253 (greek). The problem should
 be with other non-ascii characters too. Is there any workaround for this?

 Thanks in advance
 I.Soumpasis

   [[alternative HTML version deleted]]

 __
 R-help@stat.math.ethz.ch mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.


Thomas Lumley   Assoc. Professor, Biostatistics
[EMAIL PROTECTED]   University of Washington, Seattle

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Help with efficient double sum of max (X_i, Y_i) (X Y vectors)

2007-02-02 Thread Thomas Lumley

On Thu, 1 Feb 2007, Ravi Varadhan wrote:

 Jeff,

 Here is something which is a little faster:

 sum1 - sum(outer(x, x, FUN=pmax))
 sum3 - sum(outer(x, y, FUN=pmax))

This is the sort of problem where profiling can be useful.  My experience 
with pmax() is that it is surprisingly slow, presumably because it handles 
recycling and NAs

In the example I profiled (an MCMC calculation) it was measurably faster 
to use
function(x,y) {i- xy; x[i]-y[i]; x}

-thomas


 Best,
 Ravi.

 
 ---

 Ravi Varadhan, Ph.D.

 Assistant Professor, The Center on Aging and Health

 Division of Geriatric Medicine and Gerontology

 Johns Hopkins University

 Ph: (410) 502-2619

 Fax: (410) 614-9625

 Email: [EMAIL PROTECTED]

 Webpage:  http://www.jhsph.edu/agingandhealth/People/Faculty/Varadhan.html



 
 

 -Original Message-
 From: [EMAIL PROTECTED]
 [mailto:[EMAIL PROTECTED] On Behalf Of Jeffrey Racine
 Sent: Thursday, February 01, 2007 1:18 PM
 To: r-help@stat.math.ethz.ch
 Subject: [R] Help with efficient double sum of max (X_i, Y_i) (X  Y
 vectors)

 Greetings.

 For R gurus this may be a no brainer, but I could not find pointers to
 efficient computation of this beast in past help files.

 Background - I wish to implement a Cramer-von Mises type test statistic
 which involves double sums of max(X_i,Y_j) where X and Y are vectors of
 differing length.

 I am currently using ifelse pointwise in a vector, but have a nagging
 suspicion that there is a more efficient way to do this. Basically, I
 require three sums:

 sum1: \sum_i\sum_j max(X_i,X_j)
 sum2: \sum_i\sum_j max(Y_i,Y_j)
 sum3: \sum_i\sum_j max(X_i,Y_j)

 Here is my current implementation - any pointers to more efficient
 computation greatly appreciated.

  nx - length(x)
  ny - length(y)

  sum1 - 0
  sum3 - 0

  for(i in 1:nx) {
sum1 - sum1 + sum(ifelse(x[i]x,x[i],x))
sum3 - sum3 + sum(ifelse(x[i]y,x[i],y))
  }

  sum2 - 0
  sum4 - sum3 # symmetric and identical

  for(i in 1:ny) {
sum2 - sum2 + sum(ifelse(y[i]y,y[i],y))
  }

 Thanks in advance for your help.

 -- Jeff

 -- 
 Professor J. S. Racine Phone:  (905) 525 9140 x 23825
 Department of EconomicsFAX:(905) 521-8232
 McMaster Universitye-mail: [EMAIL PROTECTED]
 1280 Main St. W.,Hamilton, URL:
 http://www.economics.mcmaster.ca/racine/
 Ontario, Canada. L8S 4M4

 `The generation of random numbers is too important to be left to chance'

 __
 R-help@stat.math.ethz.ch mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.

 __
 R-help@stat.math.ethz.ch mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.


Thomas Lumley   Assoc. Professor, Biostatistics
[EMAIL PROTECTED]   University of Washington, Seattle

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] read.spss and encodings

2007-02-01 Thread Thomas Lumley

On Thu, 1 Feb 2007, Thomas Friedrichsmeier wrote:

 Hi!

 I'm having trouble with importing spss files containing non-ascii characters

Peter has explained what is going on.  It would be ideal for read.spss() 
to do the translation to the current locale. This would require knowing 
what encoding the SPSS file is using.  I think it is always a one-byte 
encoding and in your case it is apparently Latin-1, but I don't know if 
this is always the case, or how to tell which encoding it uses.

-thomas

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] How to get correct integration in C for step function?

2007-01-22 Thread Thomas Lumley

On Sun, 21 Jan 2007, Lynette wrote:

 Dear all,

 I am using Rdqags in C to realize the integration. It seems for the
 continous C function I can get correct results. However, for step functions,
 the results are not correct. For example, the following one, when integrated
 from 0 to 1 gives 1 instead of the correct 1.5


Using integrate() in R for an R-defined step function gives the right 
answer (eg on the example in ?ecdf).

This suggests a problem in your C code, since integrate() just calls 
dqags.

-thomas

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] [ESS] How to get correct integration in C for step function?

2007-01-22 Thread Thomas Lumley

On Mon, 22 Jan 2007, Lynette wrote:

 Well,

 I have no idea either. I can get correct answers for continous functions but
 incorrect for step functions.


I have just tried using Rdqags from C for the function x0 and it worked 
fine (once I had declared all the arguments correctly). The code is below.

-thomas





#include Rinternals.h
#include R_ext/Applic.h

double stepfn(double x){
return (x0);
}


SEXP call_stepfn(SEXP x){
SEXP answer;
int i,n;

n=length(x);
PROTECT(answer=allocVector(REALSXP,n));
for(i=0;in;i++){
REAL(answer)[i]=stepfn(REAL(x)[i]);
}
UNPROTECT(1);
return answer;
}


void vec_stepfn(double *x, int n, void *ex){
int i;
for (i=0;in;i++) x[i]=stepfn(x[i]);
return;
}

void Cvec_stepfn(double *x, double *n){

  vec_stepfn(x, *n, (void *) NULL);
  return;
}

SEXP int_stepfn(SEXP lower, SEXP upper){

SEXP answer; 
double result, abserr;
int last, neval, ier;
int lenw;
int *iwork;
double *work;
int limit=100;
double reltol=0.1;
double abstol=0.1;

  lenw = 4 * limit;
  iwork =   (int *) R_alloc(limit, sizeof(int));
  work = (double *) R_alloc(lenw,  sizeof(double));

  Rdqags(vec_stepfn, (void *) NULL, REAL(lower), REAL(upper),
abstol,  reltol,
   result,  abserr,  neval,  ier,
limit,  lenw, last,
iwork, work);

printf(%f %f %d\n, result,abserr, ier);

PROTECT(answer=allocVector(REALSXP,1));
REAL(answer)[0]=result;
UNPROTECT(1);

return answer;
}

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] How to get correct integration in C for step function?

2007-01-22 Thread Thomas Lumley

On Mon, 22 Jan 2007, Lynette wrote:

 Dear all, especially to Thomas,

 I have figured out the problem. For the step function, something wrong with 
 my C codes. I should use the expression ((x=0.25)(x=0.75)) ? 2:1 instead 
 of ((x=1/4)(x=3/4)) ? 2:1 ). Have no idea why 0.25 makes difference from 
 1/4 in C. But now I can go ahead with the correct integration in C. Thank you 
 all. And hope this helps to others.


1 and 4 are ints in C, so 1/4 is 0 (integer division).

-thomas

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] integrate and quadratic forms

2007-01-19 Thread Thomas Lumley


As the documentation for integrate() says, the function must be vectorized

   f: an R function taking  a numeric first argument and returning a
   numeric vector of the same length.

so you can't use sum(). You need matrix operations or an explicit loop to 
add up the terms.


-thomas




On Thu, 18 Jan 2007, rdporto1 wrote:

 Hi all.

 I'm trying to numerically invert the characteristic function
 of a quadratic form following Imhof's (1961, Biometrika 48)
 procedure.

 The parameters are:

 lambda=c(.6,.3,.1)
 h=c(2,2,2)
 sigma=c(0,0,0)
 q=3

 I've implemented Imhof's procedure two ways that, for me,
 should give the same answer:

 #more legible
 integral1 = function(u) {
  o=(1/2)*sum(h*atan(lambda*u)+sigma^2*lambda*u/(1+lambda^2*u^2)) - q*u/2
  rho=prod((1+lambda^2*u^2)^(h/4))*exp( 
 (1/2)*sum((sigma*lambda*u)^2/(1+lambda^2*u^2)) )
  integrand = sin(o)/(u*rho)
 }

 #same as above
 integral2= function(u) {
 ((1/2)*sum(h*atan(lambda*u)+sigma^2*lambda*u/(1+lambda^2*u^2)) - q*u/2)/
 (u*(prod((1+lambda^2*u^2)^(h/4))*
 exp( (1/2)*sum((sigma*lambda*u)^2/(1+lambda^2*u^2)) )))
 }

 The following should be near 0.18. However, nor the answers are near this
 value neither they agree each other!

 1/2+(1/pi)*integrate(integral1,0,Inf)$value
 [1] 1.022537
 1/2+(1/pi)*integrate(integral2,0,Inf)$value
 [1] 1.442720

 What's happening? Is this a bug or OS specific? Shouldn't they give the
 same answer? Why do I get results so different from 0.18? In time:
 the procedure works fine for q=.2.

 I'm running R 2.4.1 in a PC with Windows XP 32bits. Other ways (in R) to
 find the distribution of general quadratic forms are welcome.

 Thanks in advance.

 Rogerio.

 __
 R-help@stat.math.ethz.ch mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.


Thomas Lumley   Assoc. Professor, Biostatistics
[EMAIL PROTECTED]   University of Washington, Seattle

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Is it the PPS samples i needed in R?

2007-01-16 Thread Thomas Lumley

On Fri, 12 Jan 2007, zhijie zhang wrote:

 Dear friends,
  I want to do a unequal probability sampling, that is, Probability
 Proportionate to size, Is it right for the following programs?
 Say my original dataset is:

 ID  Population
 1 100
 2 200
 3 300
 IF the population is large ,then the corresponding ID has the large
 Probability to be selected.

 sample(A$ID, size=2, replace = FALSE, prob = A$population)
 #suppose the dataset name is A.
 Is it the PPS samples  i needed ?

No, this does not give PPS samples for size1.  The pps and sampling 
packages have code for PPS samples.

-thomas

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] SAS and R code hazard ratios

2007-01-10 Thread Thomas Lumley

On Wed, 10 Jan 2007, [EMAIL PROTECTED] wrote:
 I am new to R and have been comparing CPH survival analysis hazard ratios
 between R and SAS PhReg.  The binary covariates' HRs are the same, however
 the continuous variables, for example age, have quite different HRs
 although in the same direction.  SAS PhReg produces HRs which are the
 change in risk for every one increment change in the independent variable.
 How do I interpret the HRs produced by R?Thanks much, C


What function did you use to fit the model in R?  If you used coxph(), in 
the survival package then you should get the same answers as SAS. If you 
used cph() in the design package then the output says what the increment 
is that correponds to the quoted hazard ratio, and the default is the 
interquartile range.

-thomas

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] memory problem

2007-01-08 Thread Thomas Lumley

On Sat, 6 Jan 2007, Zoltan Kmetty wrote:

 Hi!

 I had some memory problem with R - hope somebody could tell me a solution.

 I work with very large datasets, but R cannot allocate enough memoty to
 handle these datasets.

You haven't said what you want to do with these datasets.

-thomas

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] eval(parse(text vs. get when accessing a function

2007-01-05 Thread Thomas Lumley

On Fri, 5 Jan 2007, Ramon Diaz-Uriarte wrote:

 I see, this is direct way of dealing with the problem. However, you first need
 to build the f list, and you might not know about that ahead of time. For
 instance, if I build a function so that the only thing that you need to do to
 use my function g is to call your function f.something, and then pass
 the something.

 I am still under the impression that, given your answer,
 using eval(parse(text is not your preferred way.  What are the possible
 problems (if there are any, that is). I guess I am puzzled by rethink
 whether that was really the right question.


There are definitely situations where parse() is necessary or convenient,
or we wouldn't provide it. For example, there are some formula-manipulation 
problems where it really does seem to be the best solution.

The point of my observation was that it is relatively common for people to ask 
about parse() solutions to problems, but relatively rare to see them in code by 
experienced R programmers.  The 'rethink the question' point is that a 
narrowly-posed programming problem may suggest parse() as the answer, when 
thinking more broadly about what you are trying to do may allow a completely 
different approach [the example of lists is a common one].

The problem with eval(parse()) is not primarily one of speed.  A problem with 
parse() is than manipulating text strings is easy to mess up, since text has so 
much less structure than code. A problem with eval() is that it is too powerful 
-- since it can do anything, it is harder to keep track of what it is doing.

In one sense this is just a style issue, but I still think my comment is good 
advice. If you find yourself wanting to use parse() it is a good idea to stop 
and think about whether there is a better way to do it. Often, there is. 
Sometimes, there isn't.


-thomas

Thomas Lumley   Assoc. Professor, Biostatistics
[EMAIL PROTECTED]   University of Washington, Seattle

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] understanding integer divide (%/%)

2007-01-03 Thread Thomas Lumley

On Wed, 3 Jan 2007, ONKELINX, Thierry wrote:

 This is due to the internal representation of 0.1, which is not exactly
 0.1 but very close to it. If you want to do an integer divide, you
 should only use integers to divide with.

This must be more-or-less correct, but it is worth noting that
 0.1*10==1
[1] TRUE
 1/0.1==10
[1] TRUE
 1%/%0.1==10
[1] FALSE
so it isn't quite that simple.

Interestingly, the results seem to vary by system -- on a G4 Mac I get
1 %/% (1/x) == x for all x from 1 to 50

-thomas



 Cheers,

 Thierry

 
 

 ir. Thierry Onkelinx

 Instituut voor natuur- en bosonderzoek / Reseach Institute for Nature
 and Forest

 Cel biometrie, methodologie en kwaliteitszorg / Section biometrics,
 methodology and quality assurance

 Gaverstraat 4

 9500 Geraardsbergen

 Belgium

 tel. + 32 54/436 185

 [EMAIL PROTECTED]

 www.inbo.be



 Do not put your faith in what statistics say until you have carefully
 considered what they do not say.  ~William W. Watt

 A statistical analysis, properly conducted, is a delicate dissection of
 uncertainties, a surgery of suppositions. ~M.J.Moroney

 -Oorspronkelijk bericht-
 Van: [EMAIL PROTECTED]
 [mailto:[EMAIL PROTECTED] Namens Jeffrey Prisbrey
 Verzonden: woensdag 3 januari 2007 14:21
 Aan: r-help@stat.math.ethz.ch
 Onderwerp: [R] understanding integer divide (%/%)

 I am confused about why the following occurs:

 version
   _
 platform   i386-pc-mingw32
 arch   i386
 os mingw32
 system i386, mingw32
 status
 major  2
 minor  4.0
 year   2006
 month  10
 day03
 svn rev39566
 language   R
 version.string R version 2.4.0 (2006-10-03)
 1 %/% 0.1
 [1] 9
 10 %/% 1
 [1] 10


 This effect led me into an trap when I tried to
 classify a set of proportions based on the first
 decimal place by integer dividing by 0.1.  Can someone
 explain why this behavior occurs and give me an
 insight into how to predict it?

 Thanks,
 -- Jeff

 __
 R-help@stat.math.ethz.ch mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.

 __
 R-help@stat.math.ethz.ch mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.


Thomas Lumley   Assoc. Professor, Biostatistics
[EMAIL PROTECTED]   University of Washington, Seattle

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Any container in R?

2007-01-01 Thread Thomas Lumley

On Mon, 1 Jan 2007, Feng Qiu wrote:

 Hi Duncan:
 Thank you very much! I checked out unique(), it does exactly what I
 want. But I'm still curious about if R provides STL(standard template
 library).

No.

Some things the STL does aren't needed in R, others are implemented 
differently, and others aren't implemented.

One particularly important example is iterators, which will often either 
happen invisibly due to vectorized operations or will be done with the 
*apply family of functions.

Your example could have been done either way. Using duplicated() is the 
vectorized approach; the apply approach would use tapply().

C++ is not terribly similar to R. A lot of the effort in STL is expended 
on allowing a piece of code to be used on different types (where 
appropriate). In R you have to expend effort on stopping a piece of code 
being used on different types (where inappropriate).


-thomas

Thomas Lumley   Assoc. Professor, Biostatistics
[EMAIL PROTECTED]   University of Washington, Seattle

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] LU bug in Matrix package

2006-12-28 Thread Thomas Lumley

On Thu, 28 Dec 2006, yangguoyi.ou wrote:

 There is a bug in Matrix package, please check it, thanks!

You didn't say what R code you ran to get that output or why you think it 
is wrong.

Let us experiment to see if we can guess what the problem is from the 
limited information provided
 x-t(Matrix(1:25,5))
 x
5 x 5 Matrix of class dgeMatrix
  [,1] [,2] [,3] [,4] [,5]
[1,]12345
[2,]6789   10
[3,]   11   12   13   14   15
[4,]   16   17   18   19   20
[5,]   21   22   23   24   25

 lux-lu(x)
Warning message:
Exact singularity detected during LU decomposition.

Now check that the decomposition is correct
 with(expand(a), P%*%L%*%U)
5 x 5 Matrix of class dgeMatrix
  [,1] [,2] [,3] [,4] [,5]
[1,]12345
[2,]6789   10
[3,]   11   12   13   14   15
[4,]   16   17   18   19   20
[5,]   21   22   23   24   25

Ok, it is correct.
Now let's Look at the matrices

 expand(a)
$L
5 x 5 Matrix of class dtrMatrix
  [,1]   [,2]   [,3]   [,4]   [,5]
[1,] 1.  .  .  .  .
[2,] 0.04761905 1.  .  .  .
[3,] 0.52380952 0.5000 1.  .  .
[4,] 0.28571429 0.7500 0.35400130 1.  .
[5,] 0.76190476 0.2500 0.5000 0. 1.

$U
5 x 5 Matrix of class dtrMatrix
  [,1]  [,2]  [,3]  [,4]  [,5]
[1,]  2.10e+01  2.20e+01  2.30e+01  2.40e+01  2.50e+01
[2,] .  9.523810e-01  1.904762e+00  2.857143e+00  3.809524e+00
[3,] . . -1.919092e-15 -3.711332e-15 -5.614541e-15
[4,] . . .  3.781500e-16  6.288326e-16
[5,] . . . .  0.00e+00

$P
5 x 5 sparse Matrix of class pMatrix

[1,] . 1 . . .
[2,] . . . 1 .
[3,] . . 1 . .
[4,] . . . . 1
[5,] 1 . . . .


Perhaps the output was a trimmed version of this? The L and U matrices 
agree, but you didn't show the permutation matrix.

Did you just miss the fact that the decomposition is P%*%L%*%U? If you 
aren't used to S4 classes it is easy to miss the fact that class?LU 
gives help on the structure of the LU class, and if you try to guess 
what the components are without the documentation it is easy to be 
confused.


-thomas








 Matlab result:



 x =

 1 2 3 4 5

 6 7 8 910

1112131415

1617181920

2122232425



 lu(x)



 ans =



   21.   22.   23.   24.   25.

0.04760.95241.90482.85713.8095

0.76190.2500   -0.   -0.   -0.

0.52380.50000.4839 00.

0.28570.7500   -0.0645 00.





 Gsl result:



 21 22 23 24 25

 0.047619 0.952381 1.90476 2.85714 3.80952

 0.761905 0.25 -3.44169e-015 -6.88338e-015 -1.03251e-014

 0.52381 0.5 -0.0322581 -1.77636e-015 -1.66533e-015

 0.285714 0.75 -0.0645161 -0 2.22045e-016





 R result:

 $L

 5 x 5 Matrix of class dtrMatrix

 [,1]   [,2]   [,3]   [,4]   [,5]

 [1,] 1.  .  .  .  .

 [2,] 0.04761905 1.  .  .  .

 [3,] 0.52380952 0.5000 1.  .  .

 [4,] 0.28571429 0.7500 0.35400130 1.  .

 [5,] 0.76190476 0.2500 0.5000 0. 1.



 $U

 5 x 5 Matrix of class dtrMatrix

 [,1]  [,2]  [,3]  [,4]  [,5]

 [1,]  2.10e+01  2.20e+01  2.30e+01  2.40e+01  2.50e+01

 [2,] .  9.523810e-01  1.904762e+00  2.857143e+00  3.809524e+00

 [3,] . . -1.919092e-15 -3.711332e-15 -5.614541e-15

 [4,] . . .  3.781500e-16  6.288326e-16

 [5,] . . . .  0.00e+00












   [[alternative HTML version deleted]]

 __
 R-help@stat.math.ethz.ch mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.


Thomas Lumley   Assoc. Professor, Biostatistics
[EMAIL PROTECTED]   University of Washington, Seattle

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Memory problem on a linux cluster using a large data set [Broadcast]

2006-12-21 Thread Thomas Lumley

On Thu, 21 Dec 2006, Iris Kolder wrote:

 Thank you all for your help!

 So with all your suggestions we will try to run it on a computer with a 
 64 bits proccesor. But i've been told that the new R versions all work 
 on a 32bits processor. I read in other posts that only the old R 
 versions were capable of larger data sets and were running under 64 bit 
 proccesors. I also read that they are adapting the new R version for 64 
 bits proccesors again so does anyone now if there is a version available 
 that we could use?

Huh?  R 2.4.x runs perfectly happily accessing large memory under Linux on 
64bit processors (and Solaris, and probably others). I think it even works 
on Mac OS X now.

For example:
 x-rnorm(1e9)
 gc()
  used   (Mb) gc trigger   (Mb)   max used   (Mb)
Ncells 222881   12.0 467875   25.0 35   18.7
Vcells 1000115046 7630.3 1000475743 7633.1 1000115558 7630.3


 -thomas

Thomas Lumley   Assoc. Professor, Biostatistics
[EMAIL PROTECTED]   University of Washington, Seattle

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] attach and object masking

2006-12-19 Thread Thomas Lumley

On Tue, 19 Dec 2006, DEEPANKAR BASU wrote:

 Hi R users!

 I am new to R. When I try to attach a simple dataset using the attach() 
 command, I get the following message:

 attach(data1)

The following object(s) are masked from package:base :

 write

 Can someone tell me what this means? (`write' is the name of a variable 
 in the dataset). And, do I need to do do something about this.


It means that 'write' is the name of a variable in the dataset. R is 
warning you that you have two things called 'write' -- your variable and a 
function in the base package.

It also means that you have missed at least three upgrades of R (the 
fourth is just out) since in version 2.3.0 and more recent you don't get 
the warning when a variable and a function have the same name, only for 
two variables or two functions.  There have been quite a lot of other 
changes since your version of R, so it would be worth upgrading.

-thomas

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] attach and object masking

2006-12-19 Thread Thomas Lumley

On Tue, 19 Dec 2006, Thomas Lumley wrote:

 It also means that you have missed at least three upgrades of R (the
 fourth is just out) since in version 2.3.0 and more recent you don't get
 the warning when a variable and a function have the same name, only for
 two variables or two functions.  There have been quite a lot of other
 changes since your version of R, so it would be worth upgrading.


I just noticed I didn't say explicitly whether you need to do anything 
else about the warning. You don't.

-thomas

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Upgrading

2006-12-19 Thread Thomas Lumley

On Tue, 19 Dec 2006, DEEPANKAR BASU wrote:

 Hi!

 As per Thomas' advice, I upgraded R by using update.packages() and got 
 the following warning messages:

That was not my advice on how to upgrade. update.packages() updates the 
packages. You need to download a new version of R itself.  You will then 
need to update or reinstall the packages. The warning messages are because 
you are updating to versions of the packages that do not run on your old 
version of R.

-thomas



 Warning messages:
 1: installation of package 'lmtest' had non-zero exit status in: 
 install.packages(update[, Package], instlib, contriburl = contriburl,
 2: installation of package 'quadprog' had non-zero exit status in: 
 install.packages(update[, Package], instlib, contriburl = contriburl,
 3: installation of package 'cluster' had non-zero exit status in: 
 install.packages(update[, Package], instlib, contriburl = contriburl,
 4: installation of package 'tseries' had non-zero exit status in: 
 install.packages(update[, Package], instlib, contriburl = contriburl,

 Do I need to worry about these messages? Do I need to do something else to 
 complete the upgrade process?

 Another question: what is the command for renaming an existing variable?

 Thanks.
 Deepankar

 __
 R-help@stat.math.ethz.ch mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.


Thomas Lumley   Assoc. Professor, Biostatistics
[EMAIL PROTECTED]   University of Washington, Seattle

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 1136 matches

Mail list logo