[R] How can I overwrite a method in R?
How can I create an improved version of a method in R, and have it be used? Short version: I think plot.histogram has a bug, and I'd like to try a version with a fix. But when I call hist(), my fixed version doesn't get used. Long version: hist() calls plot() which calls plot.histogram() which fails to pass ... when it calls plot.window(). As a result hist() ignores xaxs and yaxs arguments. I'd like to make my own copy of plot.histogram that passes ... to plot.window(). If I just make my own copy of plot.histogram, plot() ignores it, because my version is not part of the same graphics package that plot belongs to. If I copy hist, hist.default and plot, the copies inherit the same environments as the originals, and behave the same. If I also change the environment of each to .GlobalEnv, hist.default fails in a .Call because it cannot find C_BinCount. [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] How can I overwrite a method in R?
Thank you Duncan, Brian, Hadley, and Lin. In Lin's suggestion, I believe the latter two statements should be reversed, so that the environment is added before the function is placed into the graphics namespace. source('plot.histogram.R') environment(plot.histogram) - asNamespace('graphics') assignInNamespace('plot.histogram', plot.histogram, ns='graphics') The middle statement could also be environment(plot.histogram) - environment(graphics:::plot.histogram) The point is to ensure that the replacement version has the same environment as the original. Having tested this, I will now submit a bug report :-) On Thu, Oct 9, 2014 at 9:11 AM, C Lin bac...@hotmail.com wrote: I posted similar question a while ago. Search for modify function in a package. In your case, the following should work. source('plot.histogram.R') assignInNamespace('plot.histogram',plot.histogram,ns='graphics') environment(plot.histogram) - asNamespace('graphics'); Assuming you have your own plot.histogram function inside plot.histogram.R and the plot.histogram function you are trying to overwrite is in graphics package. Lin From: h.wick...@gmail.com Date: Thu, 9 Oct 2014 07:00:31 -0500 To: timhesterb...@gmail.com CC: r-h...@stat.math.ethz.ch Subject: Re: [R] How can I overwrite a method in R? This is usually ill-advised, but I think it's the right solution for your problem: assignInNamespace(plot.histogram, function(...) plot(1:10), graphics) hist(1:10) Haley On Thu, Oct 9, 2014 at 1:14 AM, Tim Hesterberg timhesterb...@gmail.com wrote: How can I create an improved version of a method in R, and have it be used? Short version: I think plot.histogram has a bug, and I'd like to try a version with a fix. But when I call hist(), my fixed version doesn't get used. Long version: hist() calls plot() which calls plot.histogram() which fails to pass ... when it calls plot.window(). As a result hist() ignores xaxs and yaxs arguments. I'd like to make my own copy of plot.histogram that passes ... to plot.window(). If I just make my own copy of plot.histogram, plot() ignores it, because my version is not part of the same graphics package that plot belongs to. If I copy hist, hist.default and plot, the copies inherit the same environments as the originals, and behave the same. If I also change the environment of each to .GlobalEnv, hist.default fails in a .Call because it cannot find C_BinCount. [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- http://had.co.nz/ __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- Tim Hesterberg http://www.timhesterberg.net (resampling, water bottle rockets, computers to Costa Rica, hot shower = 2650 light bulbs, ...) Help your students understand statistics: Mathematical Statistics with Resampling and R, Chihara Hesterberg http://www.timhesterberg.net/bootstrap/ [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] boot() with glm/gnm on a contingency table
One approach is to bootstrap the vector 1:n, where n is the number of individuals, with a function that does: f - function(vectorOfIndices, theTable) { (1) create a new table with the same dimensions, but with the counts in the table based on vectorOfIndices. (2) Calculate the statistics of interest on the new table. } When f is called with 1:n, the table it creates should be the same as the original table. When called with a bootstrap sample of values from 1:n, it should create a table corresponding to the bootstrap sample. Tim Hesterberg http://www.timhesterberg.net (resampling, water bottle rockets, computers to Costa Rica, shower = 2650 light bulbs, ...) NEW! Mathematical Statistics with Resampling and R, Chihara Hesterberg http://www.amazon.com/Mathematical-Statistics-Resampling-Laura-Chihara/dp/1118029852/ref=sr_1_1?ie=UTF8 Hi everyone! In a package I'm developing, I have created a custom function to get jackknife standard errors for the parameters of a gnm model (which is essentially the same as a glm model for this issue). I'd like to add support for bootstrap using package boot, but I couldn't find how to proceed. The problem is, my data is a table object. Thus, I don't have one individual per line: when the object is converted to a data frame, one row represents one cell, or one combination of factor levels. I cannot pass this to boot() as the data argument and use indices from my custom statistic() function, since I would drop cells, not individual observations. A very inefficient solution would be to create a data frame with one row per observation, by replicating each cell using its frequencies. That's really a brute force solution, though. ;-) The other way would be generate importance weights based on observed frequencies, and to multiply the original data by the weights at each iteration, but I'm not sure that's correct. Thoughts? Thanks for your help! __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] boot() with glm/gnm on a contingency table
Le mercredi 12 septembre 2012 à 07:08 -0700, Tim Hesterberg a écrit : One approach is to bootstrap the vector 1:n, where n is the number of individuals, with a function that does: f - function(vectorOfIndices, theTable) { (1) create a new table with the same dimensions, but with the counts in the table based on vectorOfIndices. (2) Calculate the statistics of interest on the new table. } When f is called with 1:n, the table it creates should be the same as the original table. When called with a bootstrap sample of values from 1:n, it should create a table corresponding to the bootstrap sample. Indeed, that's another solution I considered, but I wanted to be sure nothing more reasonable exists. You're right that it's more efficient than replicating the whole data set. But still, with a typical table of less than 100 cells and several thousands of observations, this means creating a potentially long vector, much larger than the original data; nothing really hard with common machines, to be sure. If no other way exists, I'll use this. Thanks. In your original posting you also suggested: The other way would be generate importance weights based on observed frequencies, and to multiply the original data by the weights at each iteration, but I'm not sure that's correct. Thoughts? You could do: bootstrapTable - x # where x is the original table for(i in numberOfBootstrapSamples) { bootstrapTable[] - rmultinom(1, size = sum(x), prob = x) replicate[i] - myFunction(bootstrapTable) } # caveat - not tested I can't tell from help(boot) whether you could do it correctly there. boot has a 'weights' argument that you could use for the sampling probabilities, but you also need a way to tell it to draw sum(x) observations. Or, you could also pass boot a parametric sampler. But be careful if you use boot in either of these ways; you not only need to generate the bootstrap samples, you also need to make sure that it is does all other calculations correctly, including calculating the statistic for the original data, calculating jackknife statistics if they are used for confidence intervals, etc. Wistful sigh - this would be pretty easy to do with S+Resample. Tim Hesterberg __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Histogram to KDE
To bootstrap from a histogram, use sample(bins, replace = TRUE, prob = counts) Note that a kernel density estimate is biased, so some bootstrap confidence intervals have poor coverage properties. Furthermore, if the kernel bandwidth is data-driven then the estimate is not functional, so some bootstrap and jackknife methods won't work right. Tim Hesterberg http://www.timhesterberg.net New: Mathematical Statistics with Resampling and R, Chihara Hesterberg On Fri, Aug 31, 2012 at 12:15 PM, David L Carlson dcarl...@tamu.edu wrote: Using a data.frame x with columns bins and counts: x - structure(list(bins = c(3.5, 4.5, 5.5, 6.5, 7.5, 8.5, 9.5, 10.5, 11.5, 12.5, 13.5, 14.5, 15.5), counts = c(1, 1, 2, 3, 6, 18, 19, 23, 8, 10, 6, 2, 1)), .Names = c(bins, counts), row.names = 4:16, class = data.frame) This will give you a plot of the kde estimate: Thanks. xkde - density(rep(bins, counts), bw=SJ) plot(xkde) As for the standard error or the confidence interval, you would probably need to use bootstrapping. On a similar note - is there a way in R to directly resample / cross-validate from a histogram of a data-set without recreating the original data-set ? -Original Message- Hello, I wanted to know if there was way to convert a histogram of a data-set to a kernel density estimate directly in R ? Specifically, I have a histogram [bins, counts] of samples {X1 ... XN} of a quantized variable X where there is one bin for each level of X, and I'ld like to directly get a kde estimate of the pdf of X from the histogram. Therefore, there is no additional quantization of X in the histogram. Most KDE methods in R seem to require the original sample set - and I would like to avoid re-creating the samples from the histogram. Is there some quick way of doing this using one of the standard kde methods in R ? Also, a general statistical question - is there some measure of the standard error or confidence interval or similar of a KDE of a data-set ? Thanks, -fj [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] [R-pkgs] dataframe package
A new 'dataframe' package is on CRAN. This is a modified version of the data frame code in R, modified to make fewer copies of the inputs and run faster, e.g. as.data.frame(a vector) makes 1 copy of the vector rather than 3, data.frame(a vector) makes 3 copies rather than 6, and x[, a] - newValue makes (2,2,1) copies of (old data frame, new column, integer vector of row names) rather than (6,4,2). Functionality should be identical to existing R code, with two exceptions: * bug fix, if x has two columns, then x[[4]] - value will fail rather than producing an illegal data frame * round(a data frame with numeric and factor columns) rounds the numeric columns and leaves the factor columns unchanged, rather than failing. Tim Hesterberg NEW! Mathematical Statistics with Resampling and R, Chihara Hesterberg http://www.amazon.com/Mathematical-Statistics-Resampling-Laura-Chihara/dp/1118029852/ref=sr_1_1?ie=UTF8 http://home.comcast.net/~timhesterberg (resampling, water bottle rockets, computers to Costa Rica, shower = 2650 light bulbs, ...) ___ R-packages mailing list r-packa...@r-project.org https://stat.ethz.ch/mailman/listinfo/r-packages __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Permutation or Bootstrap to obtain p-value for one sample
I'll concur with Peter Dalgaard that * a permutation test is the right thing to do - your problem is equivalent to a two-sample test, * don't bootstrap, and * don't bother with t-statistics but I'll elaborate a bit on on why, including * two approaches to the whole problem - and how your approach relates to the usual approach, * an interesting tidbit about resampling t statistics. First, I'm assuming that your x variable is irrelevant, only y matters, and that sample1 is a proper subset of df. I'll also assume that you want to look for differences in the mean, rather than arbitrary differences (in which case you might use e.g. a Kolmogorov-Smirnov test). There are two general approaches to this problem: (1) two-sample problem, sample1$y vs df$y[rows other than sample 1] (2) the approach you outlined, thinking of sample1$y as part of df$y. Consider (1), and call the two data sets y1 and y2 The basic permutation test approach is * compute the test statistic theta(y1, y2), e.g. mean(y1)-mean(y2) * repeat (or 9) times: draw a sample of size n1 from the pooled data, call that Y1, call the rest Y2 compute theta(Y1, Y2) * P-value for a one-sided test is (1 + k) / (1 + ) where k is the number of permutation samples with theta(Y1,Y2) = theta(y1,y2) The test statistic could be mean(y1) - mean(y2) mean(y1) sum(y1) t-statistic (pooled variance) P-value for a t-test (pooled variance) mean(y1) - mean(pooled data) t-statistic (unpooled variance) P-value for a t-test (unpooled variance) median(y1) - median(y2) ... The first six of those are equivalent - they give exactly the same P-value for the permutation test. The reason is that those test statistics are monotone transformations of each other, given the data. Hence, doing the pooled-variance t calculations gains nothing. Now consider your approach (2). That is equivalent to the permutation test using the test statistic: mean(y1) - mean(pooled data). Why not a bootstrap? E.g. pool the data and draw samples of size n1 and n2 from the pooled data, independently and with replacement. That is similar to the permutation test, but less accurate. Probably the easiest way to see this is to suppose there is 1 outlier in the pooled data. In any permutation iteration there is exactly 1 outlier among the two samples. With bootstrapping, there could be 0, 1, 2, The permutation test answers the question - given that there is exactly 1 outlier in my combined data, what is the probability that random chance would give a difference as large as I observed. The bootstrap would answer some other question. Tim Hesterberg NEW! Mathematical Statistics with Resampling and R, Chihara Hesterberg http://www.amazon.com/Mathematical-Statistics-Resampling-Laura-Chihara/dp/1118029852/ref=sr_1_1?ie=UTF8 http://home.comcast.net/~timhesterberg (resampling, water bottle rockets, computers to Guatemala, shower = 2650 light bulbs, ...) On Oct 8, 2011, at 16:04 , francy wrote: Hi, I am having trouble understanding how to approach a simulation: I have a sample of n=250 from a population of N=2,000 individuals, and I would like to use either permutation test or bootstrap to test whether this particular sample is significantly different from the values of any other random samples of the same population. I thought I needed to take random samples (but I am not sure how many simulations I need to do) of n=250 from the N=2,000 population and maybe do a one-sample t-test to compare the mean score of all the simulated samples, + the one sample I am trying to prove that is different from any others, to the mean value of the population. But I don't know: (1) whether this one-sample t-test would be the right way to do it, and how to go about doing this in R (2) whether a permutation test or bootstrap methods are more appropriate This is the data frame that I have, which is to be sampled: df- i.e. x y 1 2 3 4 5 6 7 8 . . . . . . 2,000 I have this sample from df, and would like to test whether it is has extreme values of y. sample1- i.e. x y 3 4 7 8 . . . . . . 250 For now I only have this: R=999 #Number of simulations, but I don't know how many... t.values =numeric(R) #creates a numeric vector with 999 elements, which will hold the results of each simulation. for (i in 1:R) { sample1 - df[sample(nrow(df), 250, replace=TRUE),] But I don't know how to continue the loop: do I calculate the mean for each simulation and compare it to the population mean? Any help you could give me would be very appreciated, Thank you. The straightforward way would be a permutation test, something like this msamp - mean(sample1$y) mpop - mean(df$y) msim - replicate(1, mean(sample(df$y, 250))) sum(abs(msim-mpop) = abs(msamp-mpop))/1 I don't really see a reason to do bootstrapping here. You say you want to test whether your sample could be a random sample from the population, so just simulate that sampling
Re: [R] Extracting components from a 'boot' class output in R
Sarath pointed out (privately) that the standard error is not actually stored as part of the bootstrap object. Instead, it is calculated on the fly when you print the bootstrap object; this is done by boot:::print.boot Similarly for bias. (The 'boot:::' is necessary because 'print.boot' is not an exported object from 'namespace:boot'). Tim Hesterberg Do names(bootObj) to find out what the components are, and use $ or [[ to extract components. Do help(boot) for a description of components of the object (look in the Value section). That is general advice in R, applying to all kinds of objects - boot, and many other functions such as lm(), return lists with a class added, and you can operate on the object as a list using names(), $, etc. Tim Hesterberg Dear R user, I used the following to do a bootstrap. bootObj-boot(data=DAT, statistic=Lp.est, R=1000,x0=3) I have the following output from the above bootstrap. How can I extract components of the output. For example, how can I extract the std.error? bootObj ORDINARY NONPARAMETRIC BOOTSTRAP Call: boot(data = DAT, statistic = Lp.est, R = 1000, x0 = 3) Bootstrap Statistics : original bias std. error t1* 794.9745 -0.341 4.042099 Any help is greatly appreciated. Thank you Sarath Banneheka __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Extracting components from a 'boot' class output in R
Do names(bootObj) to find out what the components are, and use $ or [[ to extract components. Do help(boot) for a description of components of the object (look in the Value section). That is general advice in R, applying to all kinds of objects - boot, and many other functions such as lm(), return lists with a class added, and you can operate on the object as a list using names(), $, etc. Tim Hesterberg Dear R user, I used the following to do a bootstrap. bootObj-boot(data=DAT, statistic=Lp.est, R=1000,x0=3) I have the following output from the above bootstrap. How can I extract components of the output. For example, how can I extract the std.error? bootObj ORDINARY NONPARAMETRIC BOOTSTRAP Call: boot(data = DAT, statistic = Lp.est, R = 1000, x0 = 3) Bootstrap Statistics : original bias std. error t1* 794.9745 -0.341 4.042099 Any help is greatly appreciated. Thank you Sarath Banneheka __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Question on approximations of full logistic regression model
My usual rule is that whatever gives the widest confidence intervals in a particular problem is most accurate for that problem :-) Bootstrap percentile intervals tend to be too narrow. Consider the case of the sample mean; the usual formula CI is xbar +- t_alpha sqrt( (1/(n-1)) sum((x_i - xbar)^2)) / sqrt(n) The bootstrap percentile interval for symmetric data is roughly xbar +- z_alpha sqrt( (1/(n )) sum((x_i - xbar)^2)) / sqrt(n) It is narrower than the formula CI because * z quantiles rather than t quantiles * standard error uses divisor of n rather than (n-1) In stratified sampling, the narrowness factor depends on the stratum sizes, not the overall n. In regression, estimates for some quantities may be based on a small subset of the data (e.g. coefficients related to rare factor levels). This doesn't mean we should give up on the bootstrap. There are remedies for the bootstrap biases, see e.g. Hesterberg, Tim C. (2004), Unbiasing the Bootstrap-Bootknife Sampling vs. Smoothing, Proceedings of the Section on Statistics and the Environment, American Statistical Association, 2924-2930. http://home.comcast.net/~timhesterberg/articles/JSM04-bootknife.pdf And other methods have their own biases, particularly in nonlinear applications such as logistic regression. Tim Hesterberg Thank you for your reply, Prof. Harrell. I agree with you. Dropping only one variable does not actually help a lot. I have one more question. During analysis of this model I found that the confidence intervals (CIs) of some coefficients provided by bootstrapping (bootcov function in rms package) was narrower than CIs provided by usual variance-covariance matrix and CIs of other coefficients wider. My data has no cluster structure. I am wondering which CIs are better. I guess bootstrapping one, but is it right? I would appreciate your help in advance. -- KH (11/05/16 12:25), Frank Harrell wrote: I think you are doing this correctly except for one thing. The validation and other inferential calculations should be done on the full model. Use the approximate model to get a simpler nomogram but not to get standard errors. With only dropping one variable you might consider just running the nomogram on the entire model. Frank KH wrote: Hi, I am trying to construct a logistic regression model from my data (104 patients and 25 events). I build a full model consisting of five predictors with the use of penalization by rms package (lrm, pentrace etc) because of events per variable issue. Then, I tried to approximate the full model by step-down technique predicting L from all of the componet variables using ordinary least squares (ols in rms package) as the followings. I would like to know whether I am doing right or not. library(rms) plogit- predict(full.model) full.ols- ols(plogit ~ stenosis+x1+x2+ClinicalScore+procedure, sigma=1) fastbw(full.ols, aics=1e10) Deleted Chi-Sq d.f. P Residual d.f. P AICR2 stenosis 1.41 10.2354 1.41 10.2354 -0.59 0.991 x216.78 10. 18.19 20.0001 14.19 0.882 procedure 26.12 10. 44.31 30. 38.31 0.711 ClinicalScore 25.75 10. 70.06 40. 62.06 0.544 x183.42 10. 153.49 50. 143.49 0.000 Then, fitted an approximation to the full model using most imprtant variable (R^2 for predictions from the reduced model against the original Y drops below 0.95), that is, dropping stenosis. full.ols.approx- ols(plogit ~ x1+x2+ClinicalScore+procedure) full.ols.approx$stats n Model L.R.d.f. R2 g Sigma 104.000 487.9006640 4.000 0.9908257 1.3341718 0.1192622 This approximate model had R^2 against the full model of 0.99. Therefore, I updated the original full logistic model dropping stenosis as predictor. full.approx.lrm- update(full.model, ~ . -stenosis) validate(full.model, bw=F, B=1000) index.orig trainingtest optimism index.correctedn Dxy 0.6425 0.7017 0.6131 0.0887 0.5539 1000 R20.3270 0.3716 0.3335 0.0382 0.2888 1000 Intercept 0. 0. 0.0821 -0.0821 0.0821 1000 Slope 1. 1. 1.0548 -0.0548 1.0548 1000 Emax 0. 0. 0.0263 0.0263 0.0263 1000 validate(full.approx.lrm, bw=F, B=1000) index.orig trainingtest optimism index.correctedn Dxy 0.6446 0.6891 0.6265 0.0626 0.5820 1000 R20.3245 0.3592 0.3428 0.0164 0.3081 1000 Intercept 0. 0. 0.1281 -0.1281 0.1281 1000 Slope 1. 1. 1.1104 -0.1104 1.1104 1000 Emax 0. 0. 0.0444 0.0444 0.0444 1000 Validatin revealed this approximation was not bad. Then, I made a nomogram. full.approx.lrm.nom- nomogram
Re: [R] memory and bootstrapping
(Regarding bootstrapping logistic regression.) If the number of rows with Y=1 is small, it doesn't matter that n is huge. If both number of successes and failures is huge, then ad Ripley notes you can use asymptotic CIs. The mean difference in predicted probabilities is a nonlinear function of the coefficients, so you can use the delta method to get standard errors. In general, if you're not sure about normality and bias, you can use the bootstrap to estimate how close to normal the sampling distribution is. The results may surprise you. For example, for a one-sample mean, if the population has skewness = 2 (like an exponential distn), you need n=5000 before the CLT is reasonably accurate (actual one-sided non-coverage probabilities within 10% of the nominal, for a 95% interval). Finally, you can speed up bootstrapping glms by using starting values based on the coefficients estimated from the original data. And, you can compute the model matrix once and resample rows of that along with y, rather than computing a model matrix from scratch each time. Tim Hesterberg The only reason the boot package will take more memory for 2000 replications than 10 is that it needs to store the results. That is not to say that on a 32-bit OS the fragmentation will not get worse, but that is unlikely to be a significant factor. As for the methodology: 'boot' is support software for a book, so please consult it (and not secondary sources). From your brief description it looks to me as if you should be using studentized CIs. 130,000 cases is a lot, and running the experiment on a 1% sample may well show that asymptotic CIs are good enough. On Thu, 5 May 2011, E Hofstadler wrote: hello, the following questions will without doubt reveal some fundamental ignorance, but hopefully you can still help me out. I'd like to bootstrap a coefficient gained on the basis of the coefficients in a logistic regression model (the mean differences in the predicted probabilities between two groups, where each predict() operation uses as the newdata-argument a dataframe of equal size as the original dataframe).I've got 130,000 rows and 7 columns in my dataframe. The glm-model uses all variables (as well as two 2-way interactions). System: - R-version: 2.12.2 - OS: Windows XP Pro, 32-bit - 3.16Ghz intel dual core processor, 2.9GB RAM I'm using the boot package to arrive at the standard errors for this difference, but even with only 10 replications, this takes quite a long time: 216 seconds (perhaps this is partly also due to my inefficiently programmed function underlying the boot-call, I'm also looking into that). I wanted to try out calculating a bca-bootstrapped confidence interval, which as I understand requires a lot more replications than normal-theory intervals. Drawing on John Fox' Appendix to his An R Companion to Applied Regression, I was thinking of trying out 2000 replications -- but this will take several hours to compute on my system (which isn't in itself a major issue though). My Questions: - let's say I try bootstrapping with 2000 replications. Can I be certain that the memory available to R will be sufficient for this operation? - (this relates to statistics more generally): is it a good idea in your opinion to try bca-bootstrapping, or can it be assumed that a normal theory confidence interval will be a sufficiently good approximation (letting me get away with, say, 500 replications)? Best, Esther -- Brian D. Ripley, rip...@stats.ox.ac.uk Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272866 (PA) Oxford OX1 3TG, UKFax: +44 1865 272595 __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Bootstrap 95% confidence intervals for splines
You're mixing up two concepts here, - splines - bootstrap confidence intervals Separating them may help cut the confusion. First, to do a bootstrap confidence interval for a difference in predictions in the linear regression case, do: repeat 10^4 times draw a bootstrap sample of the observations (subjects, keeping x y together) fit the linear regression to the bootstrap sample record the difference in predictions at the two x values end loop The bootstrap confidence interval is the range of the middle 95% of the recorded differences. For a spline, the procedure is the same except for fitting a spline regression: repeat 10^4 times draw a bootstrap sample of the observations (subjects, keeping x y together) fit the SPLINE regression to the bootstrap sample record the difference in predictions at the two x values end loop The bootstrap confidence interval is the range of the middle 95% of the recorded differences. Tim Hesterberg P.S. I think you're mixing up the response and explanatory variables. I'd think of eating hot dogs as the cause (explanatory variable), and waistline as the effect (response, or outcome). P.P.S. I don't like the terms independent and dependent variables, as that conflicts with the concept of independence in probability. Independent variables are generally not independent, and the dependent variable may be independent of the others. There appear to be reports in the literature that transform continuous independent variablea by the use of splines, e.g., assume the dependent variable is hot dogs eaten per week (HD) and the independent variable is waistline (WL), a normal linear regression model would be: nonconfusing_regression - lm(HD ~ WL) One might use a spline, confusion_inducing_regression_with_spline - lm(HD ~ ns(WL, df = 4) ) Now is where the problem starts. From nonconfusing_regression , I get, say 2 added hot dogs per week for each centimeter of waistline along with a s.e. of 0.5 hot dogs per week, which I multiply by 1.96 to garner each side of the 95% c.i. If I want to show what the difference between the 75th percentile (say 100 cm) and 25th percentile (say 80 cm) waistlines are, I multiply 2 by 100-80=20 and get 40 hot dogs per week as the point estimate with a similar bumping of the s.e. to 10 hot dogs per week. What do I do to get the point estimate and 95% confidence interval for the difference between 100 cm persons and 80 cm persons with confusion_inducing_regression_with_spline ? Best regards. Mitchell S. Wachtel, MD __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] jackknife-after-bootstrap
Can someone help me about detection of outliers using jackknife after bootstrap algorithm? A simple procedure is to calculate the mean of the bootstrap statistics for all bootstrap samples that omit the first of the original observations. Repeat for the second, third, ... original observation. You now have $n$ means, and can look at these for outliers. A similar approach is to calculate means of bootstrap statistics for samples that include (rather than omit) each of the the original observations. Both of those approaches can suffer from considerable random variability. Provided the number of bootstrap samples is large, a better approach is to use linear regression, where y = the vector of bootstrap statistics, length R X = R by n matrix, with X[i, j] = the number of times original observation j is included in bootstrap sample i and without an intercept. The $n$ regression coefficients give estimates of the influence of the original observations, and you can look for outliers in these influence estimates. For comparison, the first simple procedure above corresponds to taking averages of y for rows with X[, j] = 0, and the similar approach to averaging y for rows with X[, j] 0. For further discussion see Hesterberg, Tim C. (1995), Tail-Specific Linear Approximations for Efficient Bootstrap Simulations, Journal of Computational and Graphical Statistics, 4(2), 113-133. Hesterberg, Tim C. and Stephen J. Ellis (1999), Linear Approximations for Functional Statistics in Large-Sample Applications, Technical Report No. 86, Research Department, MathSoft, Inc., 1700 Westlake Ave. N., Suite 500, Seattle, WA 98109. http://home.comcast.net/~timhesterberg/articles/tech86-linear.pdf __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] How to do bootstrap for the complex sample design?
Faye wrote: Our survey is structured as : To be investigated area is divided into 6 regions, within each region, one urban community and one rural community are randomly selected, then samples are randomly drawn from each selected uran and rural community. The problems is that in urban/rural stratum, we only have one sample. In this case, how to do bootstrap? You are lucky that your sample size is 1. If it were 2 you would probably have proceeded without realizing that the answers were wrong. Suppose you had two samples in each stratum. If you proceed naturally, drawing bootstrap samples of size 2 from each stratum, this would underestimate variability by a factor of 2. In general the ordinary nonparametric bootstrap estimates of variability are biased downward by a factor of (n-1)/n -- exactly for the mean, approximately for other statistics. In multiple-sample and stratified situations, the bias depends on the stratum sizes. Three remedies are: * draw bootstrap samples of size n-1 * bootknife sampling - omit one observation (a jackknife sample), then draw a bootstrap sample of size n from that * bootstrap from a kernel density estimate, with kernel covariance equal to empirical covariance (with divisor n-1) / n. The latter two are described in Hesterberg, Tim C. (2004), Unbiasing the Bootstrap-Bootknife Sampling vs. Smoothing, Proceedings of the Section on Statistics and the Environment, American Statistical Association, 2924-2930. http://home.comcast.net/~timhesterberg/articles/JSM04-bootknife.pdf All three are undefined for samples of size 1. You need to go to some other bootstrap, e.g. a parametric bootstrap with variability estimated from other data. Tim Hesterberg __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] first post and bootstarpping problems
I use R for a year now and am dealing with geometric morphometrics of deer skulls. Yes, I am a biologist and my math skills are just beginning to brush up. To cut to the chase... I have two groups in my data (males and females) and my data is in a simple vector form. Now I need a bootstrap test for this value szc1 - ((mean(maleCent)-mean(femaCent))^ 2)/(var(maleCent)+var(femaCent)) which concerns two aforementioned groups. I have 39 males and 11 females totaling to 50 individuals. Now I don`t know how to assign this to a bootstrap boot() function. Any ideas? You should use a two-sample permutation test rather than doing a bootstrap. The basic idea is: compute the statistic for the original data compute the statistic for many permutations of the data compute upper and lower P-values using (1+number of permutation statistics that exceed the original) / (r+1) two-sided P-value = 2 * smaller of one-sided P-values Here is example code: maleCent - rnorm(39) # example data femaCent - rnorm(11) xBoth - c(maleCent, femaCent) statistic - function(x){ x1 - x[1:39] x2 - x[40:50] (mean(x1)-mean(x2))^2 / (var(x1) + var(x2)) } theta - statistic(xBoth) r - 10^5-1 permutationDistribution - numeric(r) for(i in 1:r) { permutationDistribution[i] - statistic(sample(xBoth, size=50, replace=TRUE)) } pValueLower - (1 + sum(permutationDistribution = theta)) / (r+1) pValueUpper - (1 + sum(permutationDistribution = theta)) / (r+1) pValueTwosided - 2 * min(pValueLower, pValueUpper) Tim Hesterberg __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] make methods work in lapply - remove lapply's environment
I've defined my own version of summary.default, that gives a better summary for highly skewed vectors. If I call summary(x) the method is used. If I call summary(data.frame(x)) the method is not used. I've traced this to lapply; this uses the new method: lapply(list(x), function(x) summary(x)) and this does not: lapply(list(x), summary) If I make a copy of lapply, WITHOUT the environment, then the method is used. lapply - function (X, FUN, ...) { FUN - match.fun(FUN) if (!is.vector(X) || is.object(X)) X - as.list(X) .Internal(lapply(X, FUN)) } I'm curious to hear reactions to this. There is a March 2006 thread object size vs. file size in which Duncan Murdoch wrote: Functions in R consist of 3 parts: the formals, the body, and the environment. You can't remove any part, but you can change it. That is exactly what I want to do, remove the environment, so that when I define a better version of some function that the better version is used. Here's a function to automate the process: copyFunction - function(Name){ # Copy a function, without its environment. # Name should be quoted # Return the copy file - tempfile() on.exit(unlink(file)) dput(get(Name), file = file) f - source(file)$value f } lapply - copyFunction(lapply) [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Coefficients of Logistic Regression from bootstrap - how to get them?
I'll address the question of whether you can use the bootstrap to improve estimates, and whether you can use the bootstrap to virtually increase the size of the sample. Short answer - no, with some exceptions (bumping / Random Forests). Longer answer: Suppose you have data (x1, ..., xn) and a statistic ThetaHat, that you take a number of bootstrap samples (all of size n) and let ThetaHatBar be the average of those bootstrap statistics from those samples. Is ThetaHatBar better than ThetaHat? Usually not. Usually it is worse. You have not collected any new data, you are just using the existing data in a different way, that is usually harmful: * If the statistic is the sample mean, all this does is to add some noise to the estimate * If the statistic is nonlinear, this gives an estimate that has roughly double the bias, without improving the variance. What are the exceptions? The prime example is tree models (random forests) - taking bootstrap averages helps smooth out the discontinuities in tree models. For a simple example, suppose that a simple linear regression model really holds: y = beta x + epsilon but that you fit a tree model; the tree model predictions are a step function. If you bootstrap the data, the boundaries of the step function will differ from one sample to another, so the average of the bootstrap samples smears out the steps, getting closer to the smooth linear relationship. Aside from such exceptions, the bootstrap is used for inference (bias, standard error, confidence intervals), not improving on ThetaHat. Tim Hesterberg Hi Doran, Maybe I am wrong, but I think bootstrap is a general resampling method which can be used for different purposes...Usually it works well when you do not have a presentative sample set (maybe with limited number of samples). Therefore, I am positive with Michal... P.S., overfitting, in my opinion, is used to depict when you got a model which is quite specific for the training dataset but cannot be generalized with new samples.. Thanks, --Jerry 2008/7/21 Doran, Harold [EMAIL PROTECTED]: I used bootstrap to virtually increase the size of my dataset, it should result in estimates more close to that from the population - isn't it the purpose of bootstrap? No, not really. The bootstrap is a resampling method for variance estimation. It is often used when there is not an easy way, or a closed form expression, for estimating the sampling variance of a statistic. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] sample size in bootstrap(boot)
bootstrap() and samp.bootstrap() are part of the S+Resample package, see http://www.insightful.com/downloads/libraries You could modify boot() to allow sampling with size other than n. Use caution when bootstrapping with a sample size other than n. The usual reason for bootstrapping is inference (standard errors, confidence intervals) using the actual data, including the actual sample size, not some other data that you don't have. However, there are reasons to sample with other sample sizes, e.g.: * Planning for future work, e.g. planning for a clinical trial with large n based on current sample data with small n. You may want to try different n, to see how that would affect standard errors or normality of sampling distributions. * Better accuracy. Bootstrap standard errors are biased downward, corresponding to computing the usual sample standard deviation using a divisor of n instead of (n-1). Bootstrap distributions tend to be too narrow. One remedy is to sample with size (n-1). For others see: Hesterberg, Tim C. (2004), Unbiasing the Bootstrap-Bootknife Sampling vs. Smoothing, Proceedings of the Section on Statistics and the Environment, American Statistical Association, 2924-2930. http://home.comcast.net/~timhesterberg/articles/JSM04-bootknife.pdf Tim Hesterberg (formerly of Insightful, now Google, and only now catching up on R-help) Hi Dan, Thanks for response yes i do know that bootstrap samples generated by function boot are of the same size as original dataset but somewhere in the R-help threads i saw a suggestion that one can control sample size (n) by using the following command(plz see below) but my problem is it doesnt work it gives error ( error in : n * nboot : non-numeric argument to binary operator) bootstrap(data,statistic,sampler=samp.bootstrap(size=20)) this is what somebody on R help suggested... can we fix that error somehow ? On Wed, 26 Mar 2008 08:26:22 -0700 Nordlund, Dan (DSHS/RDA) wrote: -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Zaihra T Sent: Wednesday, March 26, 2008 7:57 AM To: Jan T. Kim; R-help@r-project.org Subject: ! [R] sample size in bootstrap(boot) Hi, Can someone tell me how to control sample size (n) in bootstrap function boot in R. Can we give some option like we give for # of repeated samples(R=say 100). Will appreciate any help. thanks I don't believe so. Isn't one of the differences between the bootstrap and other kinds of resampling that the bootstrap samples with replacement a sample of the same size as the original data? You could use the function sample() to select your subsets and compute your statistics of interest. Hope this is helpful, Dan Daniel J. Nordlund Research and Data Analysis Washington State Department of Social and! Health Services Olympia, WA 98504-5204 __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] applying a function to data frame columns
You can do: lapply2(u, v, function(u,v) u[inRange(u, range(v))]) using two functions 'lapply2' and 'inRange' defined at bottom. This basically does: lapply(seq(along=u), function(i, U, V){ u - U[[i]] v - V[[i]] u[u = range(v)[1] u = range(v)[2]] }, U = u, V = v) Tim Hesterberg I want to apply this function to the columns of a data frame: u[u = range(v)[1] u = range(v)[2]] where u is the n column data frame under consideration and v is a data frame of values with the same number of columns as u. For example, v1 - c(1,2,3) v2 - c(3,4,5) v3 - c(2,3,4) v - as.data.frame(cbind(v1,v2,v3)) uk1 - seq(min(v1) - .5, max(v1) + .5, .5) uk2 - seq(min(v2) - .5, max(v2) + .5, .5) uk3 - seq(min(v3) - .5, max(v3) + .5, .5) u - do.call(expand.grid, list(uk1,uk2,uk3)) Here, there are 3 columns; instead of hard-coding this, can the function given above, which will restrict the u data frame to values within the ranges of each variable, be done with the apply function? Thanks in advance. dxc13 # inRange requires ifelse1, part of the splus2R package. inRange - function(x, a, b, strict = FALSE) { # Return TRUE where x is within the range of a to b. # If a is length 2 and b is missing, assume that a gives the range. # if(strict==FALSE), then allow equality, otherwise require a x b. # strict may be a vector of length 2, governing the two ends. if(length(a)==2) { b - a[2] a - a[1] } else if(length(a) * length(b) != 1) stop(a and b must both have length 1, or a may have length 2) strict - rep(strict, length=2) ifelse1(strict[1], xa, x=a) ifelse1(strict[2], xb, x=b) } lapply2 - function(X1, X2, FUN, ...){ # Like lapply, but for two inputs. # FUN should take two inputs, one from X1 and one from X2. n1 - length(X1) if(n1 != length(X2)) stop(X1 and X2 have different lengths) if(is.character(FUN)) FUN - getFunction(FUN) else if(!is.function(FUN)) { farg - substitute(FUN) if(is.name(farg)) FUN - getFunction(farg) else stop(', deparseText(farg), ' is not a function) } FUNi - function(i, X1, X2, FUN2, ...) FUN2(X1[[i]], X2[[i]], ...) # Create sequence vector. # If objects have same names, use them. i - seq(length = n1) names1 - names(X1) if(length(names1) identical(names1, names(X2))) names(i) - names1 # Final result; loop over the sequence vector lapply(i, FUNi, X1 = X1, X2 = X2, FUN2 = FUN, ...) } __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Does the t.test in R uses Welch procedure or ordinary student t-test?
First, a clarification. The subject line suggests that the Welch procedure is not an ordinary student t-test. That is incorrect. There are two common two-sample t-tests: non-pooled variance (Welch version) pooled variance I would refer to the non-pooled version just as a two-sample t-test; if you want to be definite, you could refer to Welch or non-pooled. The other common t-test is the pooled-variance t-test. In the old days, the pooled-variance t-test was considered standard, but statistics has shifted in favor of the non-pooled test being the standard; someone using the pooled-variance version should note the choice, and justify the choice (at least to him or her-self). Also note that an F-test is often a poor way to justify pooling, because it F-test is not robust against non-normality. To make a preliminary test on variances is rather like putting to sea in a rowing boat to find out whether conditions are sufficiently calm for an ocean liner to leave port. (G.E.P. Box, Non-normality and tests on variances, Biometrika, 40 (1953), pp 318-335, quote on page 333; via from Moore McCabe. Tim Hesterberg Dear all I have run t.test(), and get a output similar to this: t.test(extra ~ group, data = sleep) Welch Two Sample t-test data: extra by group t = -1.8608, df = 17.776, p-value = 0.0794 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -3.3654832 0.2054832 sample estimates: mean in group 1 mean in group 2 0.752.33 Should this be refered as a Welch procedure or ordinary student t-test? Regards Kes __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] use rowSums or colSums instead of apply!
There were two queries recently regarding removing rows or columns that have all NAs. Three respondents suggested combinations of apply() with any() or all(). I cringe when I see apply() used unnecessarily. Using rowSums() or colSums() is much faster, and gives more readable code. (Two respondents did suggest colSums for the second query.) # original small data frame df - data.frame(col1=c(1:3,NA,NA,4),col2=c(7:9,NA,NA,NA),col3=c(2:4,NA,NA,4)) system.time( for(i in 1:10^4) temp - rowSums(is.na(df)) 3) # .078 system.time( for(i in 1:10^4) temp - apply(df,1,function(x)any(!is.na(x # 3.33 # larger data frame x - matrix(runif(10^5), 10^3) x[ runif(10^5) .99 ] - NA df2 - data.frame(x) system.time( for(i in 1:100) temp - rowSums(is.na(df2)) 100) # .34 system.time( for(i in 1:10^4) temp - apply(df,1,function(x)any(!is.na(x # 3.34 Tim Hesterberg __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Sampling
Thomas Lumley wrote: On Wed, 6 Feb 2008, Tim Hesterberg wrote: Tim Hesterberg wrote: I'll raise a related issue - sampling with unequal probabilities, without replacement. R does the wrong thing, in my opinion: ... Peter Dalgaard wrote: But is that the right thing? ... (See bottom for more of the previous messages.) First, consider the common case, where size * max(prob) 1 -- sampling with unequal probabilities without replacement. Why do people do sampling with unequal probabilities, without replacement? A typical application would be sampling with probability proportional to size, or more generally where the desire is that selection probabilities match some criterion. In real survey PPS sampling it also matters what the pairwise joint selection probabilities are -- and there are *many* algorithms, with different properties. Yves Till'e has written an R package that implements some of them, and the pps package implements others. Brewer Hanif (1983) Sampling with unequal probabilities, Springer give 50 algorithms. The default S-PLUS algorithm does that. The selection probabilities at each of step 1, 2, ..., size are all equal to prob, and the overall probabilities of selection are size*prob. Umm, no, they aren't. Splus 7.0.3 doesn't say explicitly what its algorithm is, but is happy to take a sample of size 10 from a population of size 10 with unequal sampling probabilities. The overall selection probability *can't* be anything other than 1 for each element -- sampling without replacement and proportional to any other set of probabilities is impossible. The default S-PLUS algorithm when probabilities are supplied is sampling with minimal replacement, not without replacement. Minimal replacement is not interpreted strictly - duplicates may occur if (size*max(prob)1), so that selection probabilities are right at every step: 10 * (1:10 / sum(1:10)) # expected counts [1] 0.1818182 0.3636364 0.5454545 0.7272727 0.9090909 1.0909091 1.2727273 [8] 1.4545455 1.6363636 1.8181818 sample(10, size=10, prob = 1:10) # default arguments give minimal replacement [1] 9 9 5 10 3 10 7 6 7 8 One correction to my statement - in the case that size * max(prob) 1, replace replace overall probabilities of selection are size*prob with expected number of times each observation is selected is size*prob. In contrast, minimal=F gives the same algorithm as R, without replacement: sample(10, size=10, prob = 1:10, minimal = F) [1] 8 10 9 6 7 3 4 5 1 2 Note that the later draws tend to be observations with small prob, because that is what is left over. Here is a quick simulation to demonstrate selection probabilities: values - sapply(1:10^3, function(i) sample(5, size=3, prob = 1:5)) apply(values, 1, table) / 10^3 In S-PLUS, with minimal replacement [,1] [,2] [,3] 1 0.068 0.060 0.065 2 0.145 0.140 0.137 3 0.181 0.214 0.197 4 0.274 0.243 0.276 5 0.332 0.343 0.325 In R: [,1] [,2] [,3] 1 0.051 0.082 0.117 2 0.140 0.159 0.211 3 0.208 0.209 0.230 4 0.269 0.261 0.230 5 0.332 0.289 0.212 Column k corresponds to the kth draw. Note that in S+ all columns match the specified probabilities prob = (1:5/15), while in R only the first column does. Splus 7.0.3 doesn't say explicitly what its algorithm is, ... I'll add it to the help file, and post to r-devel. Even in a milder case -- samples of size 5 from 1:10 with probabilities proportional to 1:10 -- the deviation is noticeable in 1000 replications. In this case sampling with the specified probabilities is actually possible, but S-PLUS doesn't do it. I think you're looking at the case of sampling without replacement, not sampling with minimal replacement. For sampling without replacement S-PLUS uses the same algorithm as R -- at each step, draw an observation with probabilities proportional to prob, omitting observations already drawn. Now, it might be useful to add another replace=FALSE sampler to sample(), such as the newish Conditional Poisson Sampler based on the work of S.X.Chen. This does give correct marginal probabilities of inclusion, and the pairwise joint probabilities are not too hard to compute. That could be interesting. The pairwise joint probabilities are important in some applications. Are all the pairwise joint probabilities nonzero (when size*prob allows that)? Pedro J. Saavedra (2005) Comparison of Two Weighting Schemes for Sampling with Minimal Replacement http://www.amstat.org/Sections/Srms/Proceedings/y2005/Files/JSM2005-000882.pdf finds that the other scheme has some advantages over the one I used. A historical note - Saavedra notes that Chromy (1979) had the concept of sampling with minimal replacement under a different name. I used the term in the S+Resample version released 8/26/2002. Note that any algorithm for sampling without replacement can be extended to sampling with minimal replacement, by first drawing observations using the integer part of (size*prob
Re: [R] R object as a function
This function is a vectorized alternative to integrate: CumIntegrate - function(f, a = 0, b = 1, nintervals = 5, ...){ # Cumulative integral. f is a function, a and b are endpoints # return list(x,y) of two vectors, # where x = seq(a, b, length = nintervals) # and y = \int_a^{x} f(t) dt # # 10-point Gaussian on each of nintervals subintervals # Assume that f can take a vector argument ends - seq(a, b, len = nintervals + 1) h - (b - a)/(2 * nintervals)#half-width of a single subinterval mids - (ends[-1] + ends[1:nintervals])/2 x - c(0.14887433898163099, 0.43339539412924699, 0.67940956829902399, 0.86506336668898498, 0.97390652851717197) wt - c(0.29552422471475298, 0.26926671930999602, 0.21908636251598201, 0.149451349150581, 0.066671344308687999) xvalues - outer(h * c( - x, x), mids, +) list( x = ends, y = c(0,h*cumsum(colSums( matrix( wt*f(c(xvalues), ...), 10) } Tim Hesterberg On 22/01/2008 5:30 AM, Thomas Steiner wrote: I want to use a function as an argument to ingtegrate it twice. ... Duncan Murdoch wrote: ... The other problem is that integrate is not vectorized, it can only take scalar values for lower and upper, so you'll need a loop within innerFkn to do the integral over the values being passed in. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Bootstrap Correlation Coefficient with Moving Block Bootstrap
It sounds like you should sample x and y together using the block bootstrap. If you have the usual situation, x and y in columns and observations in rows, then sample blocks of rows. Even though observations in y are independent, you would take advantage of that only for bootstrapping statistics that depend only on y. The answer to your second question is the same as the first - sample blocks of observations, keeping x and y together. Tim Hesterberg Hello. I have got two problems in bootstrapping from dependent data sets. Given two time-series x and y. Both consisting of n observations with x consisting of dependent and y consisting of independent observations over time. Also assume, that the optimal block-length l is given. To obtain my bootstrap sample, I have to draw pairwise, but there is the problem of dependence of the x-observations and so if I draw the third observation of y, I cannot simply draw the third observation of x (to retain the serial correlation structure between x and y), because I devided x into blocks of length l and I have to draw blocks, then I draw from x. 1. How can I compute a bootstrap sample of the correlation coefficient between x and y with respect to the dependence in time-series of x? 2. How does it look like, if x and y both consist of dependent observations? I hope you can help me. I got really stuck with this problem. Sincerly Klein. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] rowSums() and is.integer()
I wrote the original rowSums (in S-PLUS). There, rowSums() does not coerce integer to double. However, one advantage of coercion is to avoid integer overflow. Tim Hesterberg ... So, why does rowSums() coerce to double (behaviour that is undesirable for me)? __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.