[R] How can I overwrite a method in R?

2014-10-09 Thread Tim Hesterberg
How can I create an improved version of a method in R, and have it be used?

Short version:
I think plot.histogram has a bug, and I'd like to try a version with a fix.
But when I call hist(), my fixed version doesn't get used.

Long version:
hist() calls plot() which calls plot.histogram() which fails to pass ...
when it calls plot.window().
As a result hist() ignores xaxs and yaxs arguments.
I'd like to make my own copy of plot.histogram that passes ... to
plot.window().

If I just make my own copy of plot.histogram, plot() ignores it, because my
version is not part of the same graphics package that plot belongs to.

If I copy hist, hist.default and plot, the copies inherit the same
environments as
the originals, and behave the same.

If I also change the environment of each to .GlobalEnv, hist.default fails
in
a .Call because it cannot find C_BinCount.

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] How can I overwrite a method in R?

2014-10-09 Thread Tim Hesterberg
Thank you Duncan, Brian, Hadley, and Lin.

In Lin's suggestion, I believe the latter two statements should be
reversed, so that the environment is added before the function is placed
into the graphics namespace.

source('plot.histogram.R')
environment(plot.histogram) - asNamespace('graphics')
assignInNamespace('plot.histogram', plot.histogram, ns='graphics')

The middle statement could also be
environment(plot.histogram) - environment(graphics:::plot.histogram)
The point is to ensure that the replacement version has the same
environment as the original.

Having tested this, I will now submit a bug report :-)

On Thu, Oct 9, 2014 at 9:11 AM, C Lin bac...@hotmail.com wrote:

 I posted similar question a while ago. Search for modify function in a
 package.
 In your case, the following should work.

 source('plot.histogram.R')
 assignInNamespace('plot.histogram',plot.histogram,ns='graphics')
 environment(plot.histogram) - asNamespace('graphics');

 Assuming you have your own plot.histogram function inside
 plot.histogram.R and the plot.histogram function you are trying to
 overwrite is in graphics package.

 Lin

 
  From: h.wick...@gmail.com
  Date: Thu, 9 Oct 2014 07:00:31 -0500
  To: timhesterb...@gmail.com
  CC: r-h...@stat.math.ethz.ch
  Subject: Re: [R] How can I overwrite a method in R?
 
  This is usually ill-advised, but I think it's the right solution for
  your problem:
 
  assignInNamespace(plot.histogram, function(...) plot(1:10), graphics)
  hist(1:10)
 
  Haley
 
  On Thu, Oct 9, 2014 at 1:14 AM, Tim Hesterberg timhesterb...@gmail.com
 wrote:
  How can I create an improved version of a method in R, and have it be
 used?
 
  Short version:
  I think plot.histogram has a bug, and I'd like to try a version with a
 fix.
  But when I call hist(), my fixed version doesn't get used.
 
  Long version:
  hist() calls plot() which calls plot.histogram() which fails to pass ...
  when it calls plot.window().
  As a result hist() ignores xaxs and yaxs arguments.
  I'd like to make my own copy of plot.histogram that passes ... to
  plot.window().
 
  If I just make my own copy of plot.histogram, plot() ignores it,
 because my
  version is not part of the same graphics package that plot belongs to.
 
  If I copy hist, hist.default and plot, the copies inherit the same
  environments as
  the originals, and behave the same.
 
  If I also change the environment of each to .GlobalEnv, hist.default
 fails
  in
  a .Call because it cannot find C_BinCount.
 
  [[alternative HTML version deleted]]
 
  __
  R-help@r-project.org mailing list
  https://stat.ethz.ch/mailman/listinfo/r-help
  PLEASE do read the posting guide
 http://www.R-project.org/posting-guide.html
  and provide commented, minimal, self-contained, reproducible code.
 
 
 
  --
  http://had.co.nz/
 
  __
  R-help@r-project.org mailing list
  https://stat.ethz.ch/mailman/listinfo/r-help
  PLEASE do read the posting guide
 http://www.R-project.org/posting-guide.html
  and provide commented, minimal, self-contained, reproducible code.





-- 
Tim Hesterberg
http://www.timhesterberg.net
 (resampling, water bottle rockets, computers to Costa Rica, hot shower =
2650 light bulbs, ...)

Help your students understand statistics:
Mathematical Statistics with Resampling and R, Chihara  Hesterberg
http://www.timhesterberg.net/bootstrap/

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] boot() with glm/gnm on a contingency table

2012-09-12 Thread Tim Hesterberg
One approach is to bootstrap the vector 1:n, where n is the number
of individuals, with a function that does:
f - function(vectorOfIndices, theTable) {
  (1) create a new table with the same dimensions, but with the counts
  in the table based on vectorOfIndices.
  (2) Calculate the statistics of interest on the new table.
}

When f is called with 1:n, the table it creates should be the same
as the original table.  When called with a bootstrap sample of
values from 1:n, it should create a table corresponding to the
bootstrap sample.

Tim Hesterberg
http://www.timhesterberg.net
 (resampling, water bottle rockets, computers to Costa Rica, shower = 2650 
light bulbs, ...)

NEW!  Mathematical Statistics with Resampling and R, Chihara  Hesterberg
http://www.amazon.com/Mathematical-Statistics-Resampling-Laura-Chihara/dp/1118029852/ref=sr_1_1?ie=UTF8

Hi everyone!

In a package I'm developing, I have created a custom function to get
jackknife standard errors for the parameters of a gnm model (which is
essentially the same as a glm model for this issue). I'd like to add
support for bootstrap using package boot, but I couldn't find how to
proceed.

The problem is, my data is a table object. Thus, I don't have one
individual per line: when the object is converted to a data frame, one
row represents one cell, or one combination of factor levels. I cannot
pass this to boot() as the data argument and use indices from my
custom statistic() function, since I would drop cells, not individual
observations.

A very inefficient solution would be to create a data frame with one row
per observation, by replicating each cell using its frequencies. That's
really a brute force solution, though. ;-)

The other way would be generate importance weights based on observed
frequencies, and to multiply the original data by the weights at each
iteration, but I'm not sure that's correct. Thoughts?


Thanks for your help!

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] boot() with glm/gnm on a contingency table

2012-09-12 Thread Tim Hesterberg
Le mercredi 12 septembre 2012 à 07:08 -0700, Tim Hesterberg a écrit :
 One approach is to bootstrap the vector 1:n, where n is the number
 of individuals, with a function that does:
 f - function(vectorOfIndices, theTable) {
   (1) create a new table with the same dimensions, but with the counts
   in the table based on vectorOfIndices.
   (2) Calculate the statistics of interest on the new table.
 }

 When f is called with 1:n, the table it creates should be the same
 as the original table.  When called with a bootstrap sample of
 values from 1:n, it should create a table corresponding to the
 bootstrap sample.
Indeed, that's another solution I considered, but I wanted to be sure
nothing more reasonable exists. You're right that it's more efficient
than replicating the whole data set. But still, with a typical table of
less than 100 cells and several thousands of observations, this means
creating a potentially long vector, much larger than the original data;
nothing really hard with common machines, to be sure.

If no other way exists, I'll use this. Thanks.

In your original posting you also suggested:
The other way would be generate importance weights based on observed
frequencies, and to multiply the original data by the weights at each
iteration, but I'm not sure that's correct. Thoughts?

You could do:

bootstrapTable - x  # where x is the original table
for(i in numberOfBootstrapSamples) {
  bootstrapTable[] - rmultinom(1, size = sum(x), prob = x)
  replicate[i] - myFunction(bootstrapTable)
}
# caveat - not tested


I can't tell from help(boot) whether you could do it correctly there.
boot has a 'weights' argument that you could use for the sampling
probabilities, but you also need a way to tell it to draw sum(x)
observations.  Or, you could also pass boot a parametric sampler.  
But be careful if you use boot in either of these ways; you not only
need to generate the bootstrap samples, you also need to make sure
that it is does all other calculations correctly, including
calculating the statistic for the original data, calculating jackknife
statistics if they are used for confidence intervals, etc.

Wistful sigh - this would be pretty easy to do with S+Resample.

Tim Hesterberg

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Histogram to KDE

2012-09-06 Thread Tim Hesterberg
To bootstrap from a histogram, use
  sample(bins, replace = TRUE, prob = counts)

Note that a kernel density estimate is biased, so some bootstrap
confidence intervals have poor coverage properties.
Furthermore, if the kernel bandwidth is data-driven then the estimate
is not functional, so some bootstrap and jackknife methods won't work right.

Tim Hesterberg
http://www.timhesterberg.net
New:  Mathematical Statistics with Resampling and R, Chihara  Hesterberg

On Fri, Aug 31, 2012 at 12:15 PM, David L Carlson dcarl...@tamu.edu wrote:

 Using a data.frame x with columns bins and counts:

 x - structure(list(bins = c(3.5, 4.5, 5.5, 6.5, 7.5, 8.5, 9.5, 10.5,
 11.5, 12.5, 13.5, 14.5, 15.5), counts = c(1, 1, 2, 3, 6, 18,
 19, 23, 8, 10, 6, 2, 1)), .Names = c(bins, counts), row.names =
 4:16,
 class = data.frame)

 This will give you a plot of the kde estimate:


Thanks.


 xkde - density(rep(bins, counts), bw=SJ)
 plot(xkde)

 As for the standard error or the confidence interval, you would probably
 need to use bootstrapping.



 On a similar note - is there a way in R to directly resample /
cross-validate from a histogram of a data-set without recreating the
original data-set ?


   -Original Message-
 
  Hello,
  I wanted to know if there was way to convert a histogram of a data-set
  to a
  kernel density estimate directly in R ?
 
  Specifically, I have a histogram [bins, counts] of samples {X1 ...
  XN} of a quantized variable X where there is one bin for each level of
  X,
  and I'ld like to directly get a kde estimate of the pdf of X from the
  histogram. Therefore, there is no additional quantization of X in the
  histogram. Most KDE methods in R seem to require the original sample
  set   - and I would like to avoid re-creating the samples from the
  histogram. Is there some quick way of doing this using one of the
  standard
  kde methods in R ?
 
  Also, a general statistical question - is there some measure of the
  standard error or confidence interval or similar of a KDE of a data-set
  ?
 
  Thanks,
  -fj
 


   [[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] [R-pkgs] dataframe package

2012-05-23 Thread Tim Hesterberg
A new 'dataframe' package is on CRAN.  This is a modified version
of the data frame code in R, modified to make fewer copies of the inputs
and run faster, e.g.
  as.data.frame(a vector)
makes 1 copy of the vector rather than 3,
  data.frame(a vector)
makes 3 copies rather than 6, and
  x[, a] - newValue
makes (2,2,1) copies of (old data frame, new column, integer vector
of row names) rather than (6,4,2).

Functionality should be identical to existing R code, with two exceptions:
* bug fix, if x has two columns, then x[[4]] - value will fail rather than 
  producing an illegal data frame
* round(a data frame with numeric and factor columns)
  rounds the numeric columns and leaves the factor columns unchanged, rather
  than failing.

Tim Hesterberg

NEW!  Mathematical Statistics with Resampling and R, Chihara  Hesterberg
http://www.amazon.com/Mathematical-Statistics-Resampling-Laura-Chihara/dp/1118029852/ref=sr_1_1?ie=UTF8
http://home.comcast.net/~timhesterberg
 (resampling, water bottle rockets, computers to Costa Rica, shower = 2650 
light bulbs, ...)

___
R-packages mailing list
r-packa...@r-project.org
https://stat.ethz.ch/mailman/listinfo/r-packages

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Permutation or Bootstrap to obtain p-value for one sample

2011-10-09 Thread Tim Hesterberg
I'll concur with Peter Dalgaard that 
* a permutation test is the right thing to do - your problem is equivalent
  to a two-sample test,
* don't bootstrap, and
* don't bother with t-statistics
but I'll elaborate a bit on on why, including 
* two approaches to the whole problem - and how your approach relates
  to the usual approach,
* an interesting tidbit about resampling t statistics.

First, I'm assuming that your x variable is irrelevant, only y matters,
and that sample1 is a proper subset of df.  I'll also assume that you
want to look for differences in the mean, rather than arbitrary differences
(in which case you might use e.g. a Kolmogorov-Smirnov test).

There are two general approaches to this problem:
(1) two-sample problem, sample1$y vs df$y[rows other than sample 1]
(2) the approach you outlined, thinking of sample1$y as part of df$y.

Consider (1), and call the two data sets y1 and y2
The basic permutation test approach is
* compute the test statistic theta(y1, y2), e.g. mean(y1)-mean(y2)
* repeat  (or 9) times:
  draw a sample of size n1 from the pooled data, call that Y1, call the rest Y2
  compute theta(Y1, Y2)
* P-value for a one-sided test is (1 + k) / (1 + )
  where k is the number of permutation samples with theta(Y1,Y2) = theta(y1,y2)

The test statistic could be
  mean(y1) - mean(y2)
  mean(y1)
  sum(y1)
  t-statistic (pooled variance)
  P-value for a t-test (pooled variance)
  mean(y1) - mean(pooled data)
  t-statistic (unpooled variance)
  P-value for a t-test (unpooled variance)
  median(y1) - median(y2)
  ...
The first six of those are equivalent - they give exactly the same P-value
for the permutation test.  The reason is that those test statistics
are monotone transformations of each other, given the data.
Hence, doing the pooled-variance t calculations gains nothing.

Now consider your approach (2).  That is equivalent to the permutation
test using the test statistic:  mean(y1) - mean(pooled data).

Why not a bootstrap?  E.g. pool the data and draw samples of size
n1 and n2 from the pooled data, independently and with replacement.
That is similar to the permutation test, but less accurate.  Probably
the easiest way to see this is to suppose there is 1 outlier in the pooled data.
In any permutation iteration there is exactly 1 outlier among the two samples.
With bootstrapping, there could be 0, 1, 2, 
The permutation test answers the question - given that there is exactly
1 outlier in my combined data, what is the probability that random chance
would give a difference as large as I observed.  The bootstrap would
answer some other question.

Tim Hesterberg
NEW!  Mathematical Statistics with Resampling and R, Chihara  Hesterberg
http://www.amazon.com/Mathematical-Statistics-Resampling-Laura-Chihara/dp/1118029852/ref=sr_1_1?ie=UTF8
http://home.comcast.net/~timhesterberg
 (resampling, water bottle rockets, computers to Guatemala, shower = 2650 light 
bulbs, ...)


On Oct 8, 2011, at 16:04 , francy wrote:

 Hi,

 I am having trouble understanding how to approach a simulation:

 I have a sample of n=250 from a population of N=2,000 individuals, and I
 would like to use either permutation test or bootstrap to test whether this
 particular sample is significantly different from the values of any other
 random samples of the same population. I thought I needed to take random
 samples (but I am not sure how many simulations I need to do) of n=250 from
 the N=2,000 population and maybe do a one-sample t-test to compare the mean
 score of all the simulated samples, + the one sample I am trying to prove
 that is different from any others, to the mean value of the population. But
 I don't know:
 (1) whether this one-sample t-test would be the right way to do it, and how
 to go about doing this in R
 (2) whether a permutation test or bootstrap methods are more appropriate

 This is the data frame that I have, which is to be sampled:
 df-
 i.e.
 x y
 1 2
 3 4
 5 6
 7 8
 . .
 . .
 . .
 2,000

 I have this sample from df, and would like to test whether it is has extreme
 values of y.
 sample1-
 i.e.
 x y
 3 4
 7 8
 . .
 . .
 . .
 250

 For now I only have this:

 R=999 #Number of simulations, but I don't know how many...
 t.values =numeric(R)  #creates a numeric vector with 999 elements, which
 will hold the results of each simulation.
 for (i in 1:R) {
 sample1 - df[sample(nrow(df), 250, replace=TRUE),]

 But I don't know how to continue the loop: do I calculate the mean for each
 simulation and compare it to the population mean?
 Any help you could give me would be very appreciated,
 Thank you.

The straightforward way would be a permutation test, something like this

msamp - mean(sample1$y)
mpop - mean(df$y)
msim - replicate(1, mean(sample(df$y, 250)))

sum(abs(msim-mpop) = abs(msamp-mpop))/1

I don't really see a reason to do bootstrapping here. You say you want to test 
whether your sample could be a random sample from the population, so just 
simulate that sampling

Re: [R] Extracting components from a 'boot' class output in R

2011-07-24 Thread Tim Hesterberg
Sarath pointed out (privately) that the standard error is not actually
stored as part of the bootstrap object.

Instead, it is calculated on the fly when you print the bootstrap object;
this is done by
  boot:::print.boot
Similarly for bias.
(The 'boot:::' is necessary because 
'print.boot' is not an exported object from 'namespace:boot').

Tim Hesterberg

Do
  names(bootObj)
to find out what the components are, and use $ or [[ to extract
components.
Do
  help(boot)
for a description of components of the object (look in the Value section).

That is general advice in R, applying to all kinds of objects -
boot, and many other functions such as lm(), return lists with
a class added, and you can operate on the object as a list using
names(), $, etc.

Tim Hesterberg

Dear R user,

I used the following to do a bootstrap.


bootObj-boot(data=DAT, statistic=Lp.est,
R=1000,x0=3)

I have the following output from the above bootstrap. How
can I extract  components of the output.
For example, how can I extract the std.error?


 bootObj

ORDINARY NONPARAMETRIC BOOTSTRAP

Call:
boot(data = DAT, statistic = Lp.est, R = 1000, x0 = 3)

Bootstrap Statistics :
  original    bias  std. error
t1*  794.9745 -0.341    4.042099

Any help is greatly appreciated.

Thank you


Sarath Banneheka

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Extracting components from a 'boot' class output in R

2011-07-23 Thread Tim Hesterberg
Do
  names(bootObj)
to find out what the components are, and use $ or [[ to extract
components.
Do
  help(boot)
for a description of components of the object (look in the Value section).

That is general advice in R, applying to all kinds of objects -
boot, and many other functions such as lm(), return lists with
a class added, and you can operate on the object as a list using
names(), $, etc.

Tim Hesterberg

Dear R user,

I used the following to do a bootstrap.


bootObj-boot(data=DAT, statistic=Lp.est,
R=1000,x0=3)

I have the following output from the above bootstrap. How
can I extract  components of the output.
For example, how can I extract the std.error?


 bootObj

ORDINARY NONPARAMETRIC BOOTSTRAP

Call:
boot(data = DAT, statistic = Lp.est, R = 1000, x0 = 3)

Bootstrap Statistics :
  original    bias  std. error
t1*  794.9745 -0.341    4.042099

Any help is greatly appreciated.

Thank you


Sarath Banneheka

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Question on approximations of full logistic regression model

2011-05-16 Thread Tim Hesterberg
My usual rule is that whatever gives the widest confidence intervals
in a particular problem is most accurate for that problem :-)

Bootstrap percentile intervals tend to be too narrow.
Consider the case of the sample mean; the usual formula CI is
xbar +- t_alpha sqrt( (1/(n-1)) sum((x_i - xbar)^2)) / sqrt(n)
The bootstrap percentile interval for symmetric data is roughly
xbar +- z_alpha sqrt( (1/(n  )) sum((x_i - xbar)^2)) / sqrt(n)
It is narrower than the formula CI because
  * z quantiles rather than t quantiles
  * standard error uses divisor of n rather than (n-1)

In stratified sampling, the narrowness factor depends on the
stratum sizes, not the overall n.
In regression, estimates for some quantities may be based on a small
subset of the data (e.g. coefficients related to rare factor levels).

This doesn't mean we should give up on the bootstrap.
There are remedies for the bootstrap biases, see e.g.
Hesterberg, Tim C. (2004), Unbiasing the Bootstrap-Bootknife Sampling
vs. Smoothing, Proceedings of the Section on Statistics and the
Environment, American Statistical Association, 2924-2930.
http://home.comcast.net/~timhesterberg/articles/JSM04-bootknife.pdf

And other methods have their own biases, particularly in nonlinear
applications such as logistic regression.

Tim Hesterberg

Thank you for your reply, Prof. Harrell.

I agree with you. Dropping only one variable does not actually help a lot.

I have one more question.
During analysis of this model I found that the confidence
intervals (CIs) of some coefficients provided by bootstrapping (bootcov
function in rms package) was narrower than CIs provided by usual
variance-covariance matrix and CIs of other coefficients wider.  My data
has no cluster structure. I am wondering which CIs are better.
I guess bootstrapping one, but is it right?

I would appreciate your help in advance.
--
KH



(11/05/16 12:25), Frank Harrell wrote:
 I think you are doing this correctly except for one thing.  The validation
 and other inferential calculations should be done on the full model.  Use
 the approximate model to get a simpler nomogram but not to get standard
 errors.  With only dropping one variable you might consider just running the
 nomogram on the entire model.
 Frank


 KH wrote:

 Hi,
 I am trying to construct a logistic regression model from my data (104
 patients and 25 events). I build a full model consisting of five
 predictors with the use of penalization by rms package (lrm, pentrace
 etc) because of events per variable issue. Then, I tried to approximate
 the full model by step-down technique predicting L from all of the
 componet variables using ordinary least squares (ols in rms package) as
 the followings. I would like to know whether I am doing right or not.

 library(rms)
 plogit- predict(full.model)
 full.ols- ols(plogit ~ stenosis+x1+x2+ClinicalScore+procedure, sigma=1)
 fastbw(full.ols, aics=1e10)

   Deleted   Chi-Sq d.f. P  Residual d.f. P  AICR2
   stenosis   1.41  10.2354   1.41   10.2354  -0.59 0.991
   x216.78  10.  18.19   20.0001  14.19 0.882
   procedure 26.12  10.  44.31   30.  38.31 0.711
   ClinicalScore 25.75  10.  70.06   40.  62.06 0.544
   x183.42  10. 153.49   50. 143.49 0.000

 Then, fitted an approximation to the full model using most imprtant
 variable (R^2 for predictions from the reduced model against the
 original Y drops below 0.95), that is, dropping stenosis.

 full.ols.approx- ols(plogit ~ x1+x2+ClinicalScore+procedure)
 full.ols.approx$stats
n  Model L.R.d.f.  R2   g   Sigma
 104.000 487.9006640   4.000   0.9908257   1.3341718   0.1192622

 This approximate model had R^2 against the full model of 0.99.
 Therefore, I updated the original full logistic model dropping
 stenosis as predictor.

 full.approx.lrm- update(full.model, ~ . -stenosis)

 validate(full.model, bw=F, B=1000)
index.orig trainingtest optimism index.correctedn
 Dxy   0.6425   0.7017  0.6131   0.0887  0.5539 1000
 R20.3270   0.3716  0.3335   0.0382  0.2888 1000
 Intercept 0.   0.  0.0821  -0.0821  0.0821 1000
 Slope 1.   1.  1.0548  -0.0548  1.0548 1000
 Emax  0.   0.  0.0263   0.0263  0.0263 1000

 validate(full.approx.lrm, bw=F, B=1000)
index.orig trainingtest optimism index.correctedn
 Dxy   0.6446   0.6891  0.6265   0.0626  0.5820 1000
 R20.3245   0.3592  0.3428   0.0164  0.3081 1000
 Intercept 0.   0.  0.1281  -0.1281  0.1281 1000
 Slope 1.   1.  1.1104  -0.1104  1.1104 1000
 Emax  0.   0.  0.0444   0.0444  0.0444 1000

 Validatin revealed this approximation was not bad.
 Then, I made a nomogram.

 full.approx.lrm.nom- nomogram

Re: [R] memory and bootstrapping

2011-05-05 Thread Tim Hesterberg
(Regarding bootstrapping logistic regression.)

If the number of rows with Y=1 is small, it doesn't matter that n is huge.

If both number of successes and failures is huge, then ad Ripley notes
you can use asymptotic CIs.  The mean difference in predicted probabilities
is a nonlinear function of the coefficients, so you can use the delta method
to get standard errors.

In general, if you're not sure about normality and bias, you can use
the bootstrap to estimate how close to normal the sampling distribution
is.  The results may surprise you.  For example, for a one-sample mean,
if the population has skewness = 2 (like an exponential distn),
you need n=5000 before the CLT is reasonably accurate (actual 
one-sided non-coverage probabilities within 10% of the nominal, for a
95% interval).

Finally, you can speed up bootstrapping glms by using starting
values based on the coefficients estimated from the original data.
And, you can compute the model matrix once and resample rows of that
along with y, rather than computing a model matrix from scratch each time.

Tim Hesterberg

The only reason the boot package will take more memory for 2000
replications than 10 is that it needs to store the results.  That is
not to say that on a 32-bit OS the fragmentation will not get worse,
but that is unlikely to be a significant factor.

As for the methodology: 'boot' is support software for a book, so
please consult it (and not secondary sources).  From your brief
description it looks to me as if you should be using studentized CIs.

130,000 cases is a lot, and running the experiment on a 1% sample
may well show that asymptotic CIs are good enough.

On Thu, 5 May 2011, E Hofstadler wrote:

 hello,

 the following questions will without doubt reveal some fundamental
 ignorance, but hopefully you can still help me out.

 I'd like to bootstrap a coefficient gained on the basis of the
 coefficients in a logistic regression model (the mean differences in
 the predicted probabilities between two groups, where each predict()
 operation uses as the newdata-argument a dataframe of equal size as
 the original dataframe).I've got 130,000 rows and 7 columns in my
 dataframe. The glm-model uses all variables (as well as two 2-way
 interactions).

 System:
 - R-version: 2.12.2
 - OS: Windows XP Pro, 32-bit
 - 3.16Ghz intel dual core processor, 2.9GB RAM

 I'm using the boot package to arrive at the standard errors for this
 difference, but even with only 10 replications, this takes quite a
 long time: 216 seconds (perhaps this is partly also due to my
 inefficiently programmed function underlying the boot-call, I'm also
 looking into that).

 I wanted to try out calculating a bca-bootstrapped confidence
 interval, which as I understand requires a lot more replications than
 normal-theory intervals. Drawing on John Fox' Appendix to his An R
 Companion to Applied Regression, I was thinking of trying out 2000
 replications -- but this will take several hours to compute on my
 system (which isn't in itself a major issue though).

 My Questions:
 - let's say I try bootstrapping with 2000 replications. Can I be
 certain that the memory available to R  will be sufficient for this
 operation?
 - (this relates to statistics more generally): is it a good idea in
 your opinion to try bca-bootstrapping, or can it be assumed that a
 normal theory confidence interval will be a sufficiently good
 approximation (letting me get away with, say, 500 replications)?


 Best,
 Esther

--
Brian D. Ripley,  rip...@stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford, Tel:  +44 1865 272861 (self)
1 South Parks Road, +44 1865 272866 (PA)
Oxford OX1 3TG, UKFax:  +44 1865 272595



__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Bootstrap 95% confidence intervals for splines

2011-03-27 Thread Tim Hesterberg
You're mixing up two concepts here,
  - splines
  - bootstrap confidence intervals
Separating them may help cut the confusion.

First, to do a bootstrap confidence interval for a difference in predictions
in the linear regression case, do:

repeat 10^4 times
  draw a bootstrap sample of the observations (subjects, keeping x  y together)
  fit the linear regression to the bootstrap sample
  record the difference in predictions at the two x values
end loop
The bootstrap confidence interval is the range of the middle 95% of
the recorded differences.

For a spline, the procedure is the same except for fitting a spline regression:

repeat 10^4 times
  draw a bootstrap sample of the observations (subjects, keeping x  y together)
  fit the SPLINE regression to the bootstrap sample
  record the difference in predictions at the two x values
end loop
The bootstrap confidence interval is the range of the middle 95% of
the recorded differences.

Tim Hesterberg

P.S. I think you're mixing up the response and explanatory variables.
I'd think of eating hot dogs as the cause (explanatory variable),
and waistline as the effect (response, or outcome).

P.P.S.  I don't like the terms independent and dependent variables,
as that conflicts with the concept of independence in probability.
Independent variables are generally not independent, and the dependent
variable may be independent of the others.

There appear to be reports in the literature that transform continuous
independent variablea by the use of splines, e.g.,  assume the dependent
variable is hot dogs eaten per week (HD) and the independent variable is
waistline (WL), a normal linear regression model would be:

nonconfusing_regression  - lm(HD ~ WL)

One might use a spline,

confusion_inducing_regression_with_spline - lm(HD ~ ns(WL, df = 4) )

Now is where the problem starts.

From nonconfusing_regression , I get, say 2 added hot dogs per week for each
centimeter of waistline along with a s.e. of 0.5 hot dogs per week, which I
multiply by 1.96 to garner each side of the 95% c.i.
If I want to show what the difference between the 75th percentile (say 100
cm) and 25th percentile (say 80 cm) waistlines are, I multiply 2 by
100-80=20 and get 40 hot dogs per week as the point estimate with a similar
bumping of the s.e. to 10 hot dogs per week.

What do I do to get the point estimate and 95% confidence interval for the
difference between 100 cm persons and 80 cm persons with
confusion_inducing_regression_with_spline ?

Best regards.

Mitchell S. Wachtel, MD

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] jackknife-after-bootstrap

2010-11-15 Thread Tim Hesterberg
Can someone help me about detection of outliers using jackknife after
bootstrap algorithm?

A simple procedure is to calculate the mean of the bootstrap
statistics for all bootstrap samples that omit the first of the
original observations.  Repeat for the second, third, ... original
observation.  You now have $n$ means, and can look at these for
outliers.

A similar approach is to calculate means of bootstrap statistics
for samples that include (rather than omit) each of the the original 
observations.

Both of those approaches can suffer from considerable random variability.
Provided the number of bootstrap samples is large, a better approach
is to use linear regression, where
y = the vector of bootstrap statistics, length R
X = R by n matrix, with X[i, j] = the number of times
original observation j is included in bootstrap sample i
and without an intercept.  The $n$ regression coefficients give estimates
of the influence of the original observations, and you can look for outliers
in these influence estimates.

For comparison, the first simple procedure above corresponds to
taking averages of y for rows with X[, j] = 0, and the similar approach
to averaging y for rows with X[, j]  0.

For further discussion see
Hesterberg, Tim C. (1995), Tail-Specific Linear Approximations for Efficient 
Bootstrap Simulations, Journal of Computational and Graphical Statistics, 
4(2), 113-133.

Hesterberg, Tim C. and Stephen J. Ellis (1999), Linear Approximations for 
Functional Statistics in Large-Sample Applications, Technical Report No. 86, 
Research Department, MathSoft, Inc., 1700 Westlake Ave. N., Suite 500, Seattle, 
WA 98109.
http://home.comcast.net/~timhesterberg/articles/tech86-linear.pdf

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] How to do bootstrap for the complex sample design?

2010-11-04 Thread Tim Hesterberg
Faye wrote:
Our survey is structured as : To be investigated area is divided into
6 regions, within each region, one urban community and one rural
community are randomly selected, then samples are randomly drawn from
each selected uran and rural community.

The problems is that in urban/rural stratum, we only have one sample.
In this case, how to do bootstrap?

You are lucky that your sample size is 1.  If it were 2 you would
probably have proceeded without realizing that the answers were wrong.

Suppose you had two samples in each stratum.  If you proceed naturally,
drawing bootstrap samples of size 2 from each stratum, this would
underestimate variability by a factor of 2.

In general the ordinary nonparametric bootstrap estimates of variability
are biased downward by a factor of (n-1)/n -- exactly for the mean, 
approximately for other statistics.  In multiple-sample and stratified
situations, the bias depends on the stratum sizes.

Three remedies are:
* draw bootstrap samples of size n-1
* bootknife sampling - omit one observation (a jackknife sample), then
  draw a bootstrap sample of size n from that
* bootstrap from a kernel density estimate, with kernel covariance equal
  to empirical covariance (with divisor n-1) / n.
The latter two are described in 
Hesterberg, Tim C. (2004), Unbiasing the Bootstrap-Bootknife Sampling vs. 
Smoothing, Proceedings of the Section on Statistics and the Environment, 
American Statistical Association, 2924-2930.
http://home.comcast.net/~timhesterberg/articles/JSM04-bootknife.pdf

All three are undefined for samples of size 1.  You need to go to some
other bootstrap, e.g. a parametric bootstrap with variability estimated
from other data.

Tim Hesterberg

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] first post and bootstarpping problems

2010-10-08 Thread Tim Hesterberg
I use R for a year now and am dealing with geometric morphometrics of deer 
skulls. Yes, I am a biologist and my math skills are just beginning to brush 
up. To cut to the chase...

I have two groups in my data (males and females) and my data is in a simple 
vector form. Now I need a bootstrap test for this value

szc1 - ((mean(maleCent)-mean(femaCent))^ 2)/(var(maleCent)+var(femaCent))

which concerns two aforementioned groups. I have 39 males and 11 females 
totaling to 50 individuals. Now I don`t know how to assign this to a bootstrap 
boot() function. Any ideas?

You should use a two-sample permutation test rather than doing a bootstrap.

The basic idea is:
compute the statistic for the original data
compute the statistic for many permutations of the data
compute upper and lower P-values using 
  (1+number of permutation statistics that exceed the original) / (r+1)
two-sided P-value = 2 * smaller of one-sided P-values

Here is example code:

maleCent - rnorm(39) # example data
femaCent - rnorm(11)
xBoth - c(maleCent, femaCent)
statistic - function(x){
  x1 - x[1:39]
  x2 - x[40:50]
  (mean(x1)-mean(x2))^2 / (var(x1) + var(x2))
}
theta - statistic(xBoth)

r - 10^5-1
permutationDistribution - numeric(r)
for(i in 1:r) {
  permutationDistribution[i] - statistic(sample(xBoth, size=50, replace=TRUE))
}
  
pValueLower - (1 + sum(permutationDistribution = theta)) / (r+1)
pValueUpper - (1 + sum(permutationDistribution = theta)) / (r+1)
pValueTwosided - 2 * min(pValueLower, pValueUpper)

Tim Hesterberg

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] make methods work in lapply - remove lapply's environment

2008-09-08 Thread Tim Hesterberg
I've defined my own version of summary.default,
that gives a better summary for highly skewed vectors.

If I call
  summary(x)
the method is used.

If I call
  summary(data.frame(x))
the method is not used.

I've traced this to lapply; this uses the new method:
  lapply(list(x), function(x) summary(x))
and this does not:
  lapply(list(x), summary)

If I make a copy of lapply, WITHOUT the environment,
then the method is used.

lapply - function (X, FUN, ...) {
FUN - match.fun(FUN)
if (!is.vector(X) || is.object(X))
X - as.list(X)
.Internal(lapply(X, FUN))
}

I'm curious to hear reactions to this.
There is a March 2006 thread
object size vs. file size
in which Duncan Murdoch wrote:
 Functions in R consist of 3 parts: the formals, the body, and the
 environment. You can't remove any part, but you can change it.
That is exactly what I want to do, remove the environment, so that
when I define a better version of some function that the better
version is used.

Here's a function to automate the process:
copyFunction - function(Name){
  # Copy a function, without its environment.
  # Name should be quoted
  # Return the copy
  file - tempfile()
  on.exit(unlink(file))
  dput(get(Name), file = file)
  f - source(file)$value
  f
}
lapply - copyFunction(lapply)

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Coefficients of Logistic Regression from bootstrap - how to get them?

2008-07-27 Thread Tim Hesterberg
I'll address the question of whether you can use the bootstrap to
improve estimates, and whether you can use the bootstrap to virtually
increase the size of the sample.

Short answer - no, with some exceptions (bumping / Random Forests).

Longer answer:
Suppose you have data (x1, ..., xn) and a statistic ThetaHat,
that you take a number of bootstrap samples (all of size n) and
let ThetaHatBar be the average of those bootstrap statistics from
those samples.

Is ThetaHatBar better than ThetaHat?  Usually not.  Usually it
is worse.  You have not collected any new data, you are just using the
existing data in a different way, that is usually harmful:
* If the statistic is the sample mean, all this does is to add
  some noise to the estimate
* If the statistic is nonlinear, this gives an estimate that
  has roughly double the bias, without improving the variance.

What are the exceptions?  The prime example is tree models (random
forests) - taking bootstrap averages helps smooth out the
discontinuities in tree models.  For a simple example, suppose that a
simple linear regression model really holds:
y = beta x + epsilon
but that you fit a tree model; the tree model predictions are
a step function.  If you bootstrap the data, the boundaries of
the step function will differ from one sample to another, so
the average of the bootstrap samples smears out the steps, getting
closer to the smooth linear relationship.

Aside from such exceptions, the bootstrap is used for inference
(bias, standard error, confidence intervals), not improving on
ThetaHat.

Tim Hesterberg

Hi Doran,

Maybe I am wrong, but I think bootstrap is a general resampling method which
can be used for different purposes...Usually it works well when you do not
have a presentative sample set (maybe with limited number of samples).
Therefore, I am positive with Michal...

P.S., overfitting, in my opinion, is used to depict when you got a model
which is quite specific for the training dataset but cannot be generalized
with new samples..

Thanks,

--Jerry
2008/7/21 Doran, Harold [EMAIL PROTECTED]:

  I used bootstrap to virtually increase the size of my
  dataset, it should result in estimates more close to that
  from the population - isn't it the purpose of bootstrap?

 No, not really. The bootstrap is a resampling method for variance
 estimation. It is often used when there is not an easy way, or a closed
 form expression, for estimating the sampling variance of a statistic.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] sample size in bootstrap(boot)

2008-06-08 Thread Tim Hesterberg
bootstrap() and samp.bootstrap() are part of the S+Resample package,
see http://www.insightful.com/downloads/libraries

You could modify boot() to allow sampling with size other than n.

Use caution when bootstrapping with a sample size other than n.
The usual reason for bootstrapping is inference (standard errors,
confidence intervals) using the actual data, including the actual
sample size, not some other data that you don't have.

However, there are reasons to sample with other sample sizes, e.g.:
* Planning for future work, e.g. planning for a clinical trial with
  large n based on current sample data with small n.  You may want to
  try different n, to see how that would affect standard errors or
  normality of sampling distributions.
* Better accuracy.  Bootstrap standard errors are biased downward,
  corresponding to computing the usual sample standard deviation using
  a divisor of n instead of (n-1).  Bootstrap distributions tend to
  be too narrow.  One remedy is to sample with size (n-1).  For others
  see:
Hesterberg, Tim C. (2004), Unbiasing the Bootstrap-Bootknife Sampling
vs. Smoothing, Proceedings of the Section on Statistics and the
Environment, American Statistical Association, 2924-2930.
http://home.comcast.net/~timhesterberg/articles/JSM04-bootknife.pdf

Tim Hesterberg
(formerly of Insightful, now Google, and only now catching up on R-help)

   Hi Dan,

   Thanks  for response yes i do know that bootstrap samples generated by
   function boot are of the same size as original dataset but somewhere in the
   R-help threads i saw a suggestion that one can control sample size (n) by
   using the following command(plz see below) but my problem is it doesnt work
   it gives error ( error in : n * nboot : non-numeric argument to binary
   operator)

   bootstrap(data,statistic,sampler=samp.bootstrap(size=20))

this is what somebody on R help suggested... can we fix that error somehow
   ?

   On Wed, 26 Mar 2008 08:26:22 -0700 Nordlund, Dan (DSHS/RDA) wrote:
 -Original Message-
 From: [EMAIL PROTECTED]
 [mailto:[EMAIL PROTECTED] On Behalf Of Zaihra T
 Sent: Wednesday, March 26, 2008 7:57 AM
 To: Jan T. Kim; R-help@r-project.org
 Subject: ! [R] sample size in bootstrap(boot)


 Hi,

 Can someone tell me how to control sample size (n) in
 bootstrap function
 boot in R. Can we give some option like we give for #
 of repeated
 samples(R=say 100).

 Will appreciate any help.

 thanks
   
I don't believe so. Isn't one of the differences between the bootstrap and
   other kinds of
resampling that the bootstrap samples with replacement a sample of the
   same size as the
original data? You could use the function sample() to select your subsets
   and compute your
statistics of interest.
   
Hope this is helpful,
   
Dan
   
Daniel J. Nordlund
Research and Data Analysis
Washington State Department of Social and! Health Services
Olympia, WA 98504-5204

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] applying a function to data frame columns

2008-02-22 Thread Tim Hesterberg
You can do:

lapply2(u, v, function(u,v) u[inRange(u, range(v))])

using two functions 'lapply2' and 'inRange' defined at bottom.
This basically does:

lapply(seq(along=u),
   function(i, U, V){
 u - U[[i]]
 v - V[[i]]
 u[u = range(v)[1]  u = range(v)[2]]
   },
   U = u, V = v)

Tim Hesterberg

I want to apply this function to the columns of a data frame:

u[u = range(v)[1]  u = range(v)[2]]

where u is the n column data frame under consideration and v is a data frame
of values with the same number of columns as u.  For example,
v1 - c(1,2,3)
v2 - c(3,4,5)
v3 - c(2,3,4)
v - as.data.frame(cbind(v1,v2,v3))

uk1 - seq(min(v1) - .5, max(v1) + .5, .5)
uk2 - seq(min(v2) - .5, max(v2) + .5, .5)
uk3 - seq(min(v3) - .5, max(v3) + .5, .5)

u - do.call(expand.grid, list(uk1,uk2,uk3))

Here, there are 3 columns; instead of hard-coding this, can the function
given above, which will restrict the u data frame to values within the
ranges of each variable, be done with the apply function?  Thanks in
advance.

dxc13

# inRange requires ifelse1, part of the splus2R package.

inRange - function(x, a, b, strict = FALSE) {
  # Return TRUE where x is within the range of a to b.
  # If a is length 2 and b is missing, assume that a gives the range.
  # if(strict==FALSE), then allow equality, otherwise require a  x  b.
  # strict may be a vector of length 2, governing the two ends.
  if(length(a)==2) {
b - a[2]
a - a[1]
  }
  else if(length(a) * length(b) != 1)
stop(a and b must both have length 1, or a may have length 2)
  strict - rep(strict, length=2)
  ifelse1(strict[1], xa, x=a)  ifelse1(strict[2], xb, x=b)
}

lapply2 - function(X1, X2, FUN, ...){
  # Like lapply, but for two inputs.
  # FUN should take two inputs, one from X1 and one from X2.

  n1 - length(X1)
  if(n1 != length(X2))
stop(X1 and X2 have different lengths)

  if(is.character(FUN))
FUN - getFunction(FUN)
  else if(!is.function(FUN)) {
farg - substitute(FUN)
if(is.name(farg))
  FUN - getFunction(farg)
else
  stop(', deparseText(farg), ' is not a function)
  }

  FUNi - function(i, X1, X2, FUN2, ...)
FUN2(X1[[i]], X2[[i]], ...)

  # Create sequence vector.
  # If objects have same names, use them.
  i - seq(length = n1)
  names1 - names(X1)
  if(length(names1)  identical(names1, names(X2)))
names(i) - names1

  # Final result; loop over the sequence vector
  lapply(i, FUNi, X1 = X1, X2 = X2, FUN2 = FUN, ...)
}

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Does the t.test in R uses Welch procedure or ordinary student t-test?

2008-02-19 Thread Tim Hesterberg
First, a clarification.  The subject line suggests that the
Welch procedure is not an ordinary student t-test.
That is incorrect.

There are two common two-sample t-tests:
non-pooled variance (Welch version)
pooled variance

I would refer to the non-pooled version just as a two-sample t-test;
if you want to be definite, you could refer to Welch or non-pooled.

The other common t-test is the pooled-variance t-test.  In the old
days, the pooled-variance t-test was considered standard, but
statistics has shifted in favor of the non-pooled test being the
standard; someone using the pooled-variance version should note the
choice, and justify the choice (at least to him or her-self).

Also note that an F-test is often a poor way to justify pooling,
because it F-test is not robust against non-normality.  To make a
preliminary test on variances is rather like putting to sea in a
rowing boat to find out whether conditions are sufficiently calm for
an ocean liner to leave port.  (G.E.P. Box, Non-normality and tests
on variances, Biometrika, 40 (1953), pp 318-335, quote on page 333;
via from Moore  McCabe.

Tim Hesterberg

Dear all

I have run t.test(), and get a output similar to this:

t.test(extra ~ group, data = sleep)

Welch Two Sample t-test

data:  extra by group
t = -1.8608, df = 17.776, p-value = 0.0794
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-3.3654832  0.2054832
sample estimates:
mean in group 1 mean in group 2
   0.752.33

Should this be refered as a Welch procedure or ordinary student t-test?

Regards Kes

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] use rowSums or colSums instead of apply!

2008-02-19 Thread Tim Hesterberg
There were two queries recently regarding removing
rows or columns that have all NAs.

Three respondents suggested combinations of apply() with
any() or all().

I cringe when I see apply() used unnecessarily.
Using rowSums() or colSums() is much faster, and gives more readable
code.  (Two respondents did suggest colSums for the second query.)

# original small data frame
df - data.frame(col1=c(1:3,NA,NA,4),col2=c(7:9,NA,NA,NA),col3=c(2:4,NA,NA,4))
system.time( for(i in 1:10^4) temp - rowSums(is.na(df))  3)
# .078
system.time( for(i in 1:10^4) temp - apply(df,1,function(x)any(!is.na(x
# 3.33

# larger data frame
x - matrix(runif(10^5), 10^3)
x[ runif(10^5)  .99 ] -  NA
df2 - data.frame(x)
system.time( for(i in 1:100) temp - rowSums(is.na(df2))  100)
# .34
system.time( for(i in 1:10^4) temp - apply(df,1,function(x)any(!is.na(x
# 3.34

Tim Hesterberg

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Sampling

2008-02-07 Thread Tim Hesterberg
Thomas Lumley wrote:
On Wed, 6 Feb 2008, Tim Hesterberg wrote:

 Tim Hesterberg wrote:
 I'll raise a related issue - sampling with unequal probabilities,
 without replacement.  R does the wrong thing, in my opinion:
 ...
 Peter Dalgaard wrote:
 But is that the right thing? ...
 (See bottom for more of the previous messages.)


 First, consider the common case, where size * max(prob)  1 --
 sampling with unequal probabilities without replacement.

 Why do people do sampling with unequal probabilities, without
 replacement?  A typical application would be sampling with probability
 proportional to size, or more generally where the desire is that
 selection probabilities match some criterion.

In real survey PPS sampling it also matters what the pairwise joint 
selection probabilities are -- and there are *many* algorithms, with 
different properties. Yves Till'e has written an R package that implements 
some of them, and the pps package implements others.

Brewer  Hanif (1983) Sampling with unequal probabilities, Springer
give 50 algorithms.

 The default S-PLUS algorithm does that.  The selection probabilities
 at each of step 1, 2, ..., size are all equal to prob, and the overall
 probabilities of selection are size*prob.

Umm, no, they aren't.

Splus 7.0.3 doesn't say explicitly what its algorithm is, but is happy to 
take a sample of size 10 from a population of size 10 with unequal 
sampling probabilities.  The overall selection probability *can't* be 
anything other than 1 for each element -- sampling without replacement and 
proportional to any other set of  probabilities is impossible.

The default S-PLUS algorithm when probabilities are supplied is
sampling with minimal replacement, not without replacement.  Minimal
replacement is not interpreted strictly - duplicates may occur
if (size*max(prob)1), so that selection probabilities are right
at every step:
 10 * (1:10 / sum(1:10)) # expected counts
 [1] 0.1818182 0.3636364 0.5454545 0.7272727 0.9090909 1.0909091 1.2727273
 [8] 1.4545455 1.6363636 1.8181818
 sample(10, size=10, prob = 1:10) # default arguments give minimal replacement
 [1]  9  9  5 10  3 10  7  6  7  8

One correction to my statement - in the case that size * max(prob) 
1, replace replace overall probabilities of selection are size*prob
with expected number of times each observation is selected is size*prob.


In contrast, minimal=F gives the same algorithm as R, without replacement:
 sample(10, size=10, prob = 1:10, minimal = F)
 [1]  8 10  9  6  7  3  4  5  1  2
Note that the later draws tend to be observations with small prob,
because that is what is left over.

Here is a quick simulation to demonstrate selection probabilities:
 values - sapply(1:10^3, function(i) sample(5, size=3, prob = 1:5))
 apply(values, 1, table) / 10^3

In S-PLUS, with minimal replacement
   [,1]  [,2]  [,3] 
1 0.068 0.060 0.065
2 0.145 0.140 0.137
3 0.181 0.214 0.197
4 0.274 0.243 0.276
5 0.332 0.343 0.325

In R:
   [,1]  [,2]  [,3]
1 0.051 0.082 0.117
2 0.140 0.159 0.211
3 0.208 0.209 0.230
4 0.269 0.261 0.230
5 0.332 0.289 0.212

Column k corresponds to the kth draw.  Note that in S+ all columns
match the specified probabilities prob = (1:5/15), 
while in R only the first column does.

Splus 7.0.3 doesn't say explicitly what its algorithm is, ...

I'll add it to the help file, and post to r-devel.

Even in a milder case -- samples of size 5 from 1:10 with probabilities 
proportional to 1:10 -- the deviation is noticeable in 1000 replications. 
In this case sampling with the specified probabilities is actually 
possible, but S-PLUS doesn't do it.

I think you're looking at the case of sampling without replacement,
not sampling with minimal replacement.

For sampling without replacement S-PLUS uses the same algorithm as R
-- at each step, draw an observation with probabilities proportional
to prob, omitting observations already drawn.

Now, it might be useful to add another replace=FALSE sampler to sample(), 
such as the newish Conditional Poisson Sampler based on the work of 
S.X.Chen.  This does give correct marginal probabilities of inclusion, and 
the pairwise joint probabilities are not too hard to compute.

That could be interesting.  The pairwise joint probabilities are
important in some applications.  Are all the pairwise joint
probabilities nonzero (when size*prob allows that)?

Pedro J. Saavedra  (2005)
Comparison of Two Weighting Schemes for Sampling with Minimal Replacement 
http://www.amstat.org/Sections/Srms/Proceedings/y2005/Files/JSM2005-000882.pdf
finds that the other scheme has some advantages over the one I used.

A historical note - Saavedra notes that Chromy (1979) had the concept
of sampling with minimal replacement under a different name.
I used the term in the S+Resample version released 8/26/2002.

Note that any algorithm for sampling without replacement can be
extended to sampling with minimal replacement, by first drawing
observations using the integer part of (size*prob

Re: [R] R object as a function

2008-01-23 Thread Tim Hesterberg
This function is a vectorized alternative to integrate:

CumIntegrate - function(f, a = 0, b = 1, nintervals = 5, ...){
  # Cumulative integral.  f is a function, a and b are endpoints
  # return list(x,y) of two vectors,
  # where x = seq(a, b, length = nintervals)
  # and   y = \int_a^{x} f(t) dt
  #
  # 10-point Gaussian on each of nintervals subintervals
  # Assume that f can take a vector argument
  ends - seq(a, b, len = nintervals + 1)
  h - (b - a)/(2 * nintervals)#half-width of a single subinterval
  mids - (ends[-1] + ends[1:nintervals])/2
  x - c(0.14887433898163099, 0.43339539412924699, 0.67940956829902399,
0.86506336668898498, 0.97390652851717197)
  wt - c(0.29552422471475298, 0.26926671930999602, 0.21908636251598201,
 0.149451349150581, 0.066671344308687999)
  xvalues - outer(h * c( - x, x), mids, +)
  list( x = ends,
   y = c(0,h*cumsum(colSums( matrix( wt*f(c(xvalues), ...), 10)
}

Tim Hesterberg

On 22/01/2008 5:30 AM, Thomas Steiner wrote:
 I want to use a function as an argument to ingtegrate it twice.
 ...

Duncan Murdoch wrote:
...
The other problem is that integrate is not vectorized, it can only take 
scalar values for lower and upper, so you'll need a loop within innerFkn 
to do the integral over the values being passed in.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Bootstrap Correlation Coefficient with Moving Block Bootstrap

2007-12-06 Thread Tim Hesterberg
It sounds like you should sample x and y together using the
block bootstrap.  If you have the usual situation, x and y in columns
and observations in rows, then sample blocks of rows.

Even though observations in y are independent, you would take
advantage of that only for bootstrapping statistics that depend only
on y.

The answer to your second question is the same as the first - sample
blocks of observations, keeping x and y together.

Tim Hesterberg  

Hello.

I have got two problems in bootstrapping from
dependent data sets.

Given two time-series x and y. Both consisting of n
observations with x consisting of dependent and y
consisting of independent observations over time. Also
assume, that the optimal block-length l is given.

To obtain my bootstrap sample, I have to draw
pairwise, but there is the problem of dependence of
the x-observations and so if I draw the third
observation of y, I cannot simply draw the third
observation of x (to retain the serial correlation
structure between x and y), because I devided x into
blocks of length l and I have to draw blocks, then I
draw from x.

1.
How can I compute a bootstrap sample of the
correlation coefficient between x and y with respect
to the dependence in time-series of x?

2.
How does it look like, if x and y both consist of
dependent observations?



I hope you can help me. I got really stuck with this
problem.

Sincerly
Klein.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] rowSums() and is.integer()

2007-11-20 Thread Tim Hesterberg
I wrote the original rowSums (in S-PLUS).
There, rowSums() does not coerce integer to double.

However, one advantage of coercion is to avoid integer overflow.

Tim Hesterberg

...  So, why does rowSums() coerce to double (behaviour
that is undesirable for me)?

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.