Re: [R] how to generate a random data from a empirical distribition
hi, Frank: how can we make sure the randomly sampled data follow the same distribution as the original dataset? i assume each data point has the same prabability to be selected in a simple random sampling scheme. thanks -- View this message in context: http://r.789695.n4.nabble.com/how-to-generate-a-random-data-from-a-empirical-distribition-tp2302716p2305275.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] how to generate a random data from a empirical distribition
This is true by definition. Read about the bootstrap which may give you some good background information. Frank E Harrell Jr Professor and ChairmanSchool of Medicine Department of Biostatistics Vanderbilt University On Wed, 28 Jul 2010, xin wei wrote: hi, Frank: how can we make sure the randomly sampled data follow the same distribution as the original dataset? i assume each data point has the same prabability to be selected in a simple random sampling scheme. thanks -- View this message in context: http://r.789695.n4.nabble.com/how-to-generate-a-random-data-from-a-empirical-distribition-tp2302716p2305275.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] how to generate a random data from a empirical distribition
David Winsemius wrote: On Jul 26, 2010, at 2:36 PM, xin wei wrote: hi, this is more a statistical question than a R question. but I do want to know how to implement this in R. I have 10,000 data points. Is there any way to generate a empirical probablity distribution from it (the problem is that I do not know what exactly this distribution follows, normal, beta?). ?ecdf I'd say ?sample, for sampling w/o replacement. The inverse ecdf method is not likely to be efficient, unless you want a smoothed version of the distribution function and ecdf() doesn't help you there. My ultimate goal is to generate addition 20,000 data point from this empirical distribution created from the existing 10,000 data points. thank you all in advance. -- Peter Dalgaard Center for Statistics, Copenhagen Business School Phone: (+45)38153501 Email: pd@cbs.dk Priv: pda...@gmail.com __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] how to generate a random data from a empirical distribition
Hi: On Mon, Jul 26, 2010 at 11:36 AM, xin wei xin...@stat.psu.edu wrote: hi, this is more a statistical question than a R question. but I do want to know how to implement this in R. I have 10,000 data points. Is there any way to generate a empirical probablity distribution from it (the problem is that I do not know what exactly this distribution follows, normal, beta?). My ultimate goal is to generate addition 20,000 data point from this empirical distribution created from the existing 10,000 data points. thank you all in advance. The problem, it seems to me, is the leap of faith you're taking that the empirical distribution of your manifest sample will serve as a useful data-generating mechanism for the 20,000 future observations you want to take. I would think that, if you intend to take a sample of 20,000 from ANY distribution, you would want some confidence in the specification of said distribution. Even if you don't know exactly what type of population distribution you're dealing with, there are ways to narrow down the set of possibilities. What is the domain/support of the distribution? For example, the Normal is defined on all of R (as in the real numbers, not our favorite statistical programming language), whereas the lognormal, Gamma and Weibull distributions, among others, are defined on the nonnegative reals. The beta distribution is defined on [0, 1]. Therefore, knowledge of the domain is useful in and of itself. Is it plausible that the distribution is symmetric, or should it have a distinct left or right skew? (Similar comments apply to discrete distributions.) Is censoring or truncation a relevant concern? If there is a random process that well describes how the data you observe are generated, that will certainly narrow down the class of potential data-generating mechanisms/distributions. Once you've narrowed down the class of possible distributions as much as possible, you could look into the fitdistr() function in MASS or the fitdistrplus package on CRAN to test out which candidates seem plausible wrt your existing sample and which are not. You are not likely to be able to narrow it down to one family of distributions, but you should have a much better idea about the characteristics of the distribution that gave rise to your sample of 10,000 (assuming, of course, that it is a *random* sample) after going through this exercise, which you can apply to the generation of the next 20,000 observations. OTOH, if your existing 10,000 observations were not produced by some random process, all bets are off. HTH, Dennis -- View this message in context: http://r.789695.n4.nabble.com/how-to-generate-a-random-data-from-a-empirical-distribition-tp2302716p2302716.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] how to generate a random data from a empirical distribition
Hi Dennis, you should take a look at the CRAN task view for distributions http://cran.r-project.org/web/views/Distributions.html Beside that our distr-family of packages might be useful, see also http://www.jstatsoft.org/v35/i10/ http://cran.r-project.org/web/packages/distrDoc/vignettes/distr.pdf Best, Matthias On 27.07.2010 10:37, Dennis Murphy wrote: Hi: On Mon, Jul 26, 2010 at 11:36 AM, xin weixin...@stat.psu.edu wrote: hi, this is more a statistical question than a R question. but I do want to know how to implement this in R. I have 10,000 data points. Is there any way to generate a empirical probablity distribution from it (the problem is that I do not know what exactly this distribution follows, normal, beta?). My ultimate goal is to generate addition 20,000 data point from this empirical distribution created from the existing 10,000 data points. thank you all in advance. The problem, it seems to me, is the leap of faith you're taking that the empirical distribution of your manifest sample will serve as a useful data-generating mechanism for the 20,000 future observations you want to take. I would think that, if you intend to take a sample of 20,000 from ANY distribution, you would want some confidence in the specification of said distribution. Even if you don't know exactly what type of population distribution you're dealing with, there are ways to narrow down the set of possibilities. What is the domain/support of the distribution? For example, the Normal is defined on all of R (as in the real numbers, not our favorite statistical programming language), whereas the lognormal, Gamma and Weibull distributions, among others, are defined on the nonnegative reals. The beta distribution is defined on [0, 1]. Therefore, knowledge of the domain is useful in and of itself. Is it plausible that the distribution is symmetric, or should it have a distinct left or right skew? (Similar comments apply to discrete distributions.) Is censoring or truncation a relevant concern? If there is a random process that well describes how the data you observe are generated, that will certainly narrow down the class of potential data-generating mechanisms/distributions. Once you've narrowed down the class of possible distributions as much as possible, you could look into the fitdistr() function in MASS or the fitdistrplus package on CRAN to test out which candidates seem plausible wrt your existing sample and which are not. You are not likely to be able to narrow it down to one family of distributions, but you should have a much better idea about the characteristics of the distribution that gave rise to your sample of 10,000 (assuming, of course, that it is a *random* sample) after going through this exercise, which you can apply to the generation of the next 20,000 observations. OTOH, if your existing 10,000 observations were not produced by some random process, all bets are off. HTH, Dennis -- View this message in context: http://r.789695.n4.nabble.com/how-to-generate-a-random-data-from-a-empirical-distribition-tp2302716p2302716.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- Dr. Matthias Kohl www.stamats.de __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] how to generate a random data from a empirical distribition
-Original Message- From: r-help-boun...@r-project.org [mailto:r-help-boun...@r- project.org] On Behalf Of xin wei Sent: Monday, July 26, 2010 11:36 AM To: r-help@r-project.org Subject: [R] how to generate a random data from a empirical distribition hi, this is more a statistical question than a R question. but I do want to know how to implement this in R. I have 10,000 data points. Is there any way to generate a empirical probablity distribution from it (the problem is that I do not know what exactly this distribution follows, normal, beta?). My ultimate goal is to generate addition 20,000 data point from this empirical distribution created from the existing 10,000 data points. thank you all in advance. Without knowing more than what you have stated in your email, I can only suggest that you look at ?sample You may be able to do something as simple as newdata - olddata[sample(1:1,size=2,replace=TRUE)] If you need more help, you need to tell us more about your data and what you are trying to do. Hope this is helpful, Dan Daniel J. Nordlund Washington State Department of Social and Health Services Planning, Performance, and Accountability Research and Data Analysis Division Olympia, WA 98504-5204 __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] how to generate a random data from a empirical, distribition
On 7/27/2010 6:00 AM, r-help-requ...@r-project.org wrote: Date: Mon, 26 Jul 2010 11:36:29 -0700 (PDT) From: xin weixin...@stat.psu.edu To:r-help@r-project.org Subject: [R] how to generate a random data from a empirical distribition Message-ID:1280169389379-2302716.p...@n4.nabble.com Content-Type: text/plain; charset=us-ascii hi, this is more a statistical question than a R question. but I do want to know how to implement this in R. I have 10,000 data points. Is there any way to generate a empirical probablity distribution from it (the problem is that I do not know what exactly this distribution follows, normal, beta?). My ultimate goal is to generate addition 20,000 data point from this empirical distribution created from the existing 10,000 data points. thank you all in advance. -- View this message in context: http://r.789695.n4.nabble.com/how-to-generate-a-random-data-from-a-empirical-distribition-tp2302716p2302716.html Sent from the R help mailing list archive at Nabble.com. Ah! This brings back memories of the halcyon days of my youth when, as a junior in college, I took a course in introductory probability theory around this time during the summer in preparation for working as a co-op student the coming fall. Conceptually, why not treat your empirical sample as an urn with 10,000 items. Then take a sample of 20,000 by sampling with equal probabilities and replacement (otherwise you'll run out of cases before 20,000). Remember that all the common distributions (normal, etc.) either were derived because they fit certain common situations (e.g., binomial), are of particular use (e.g., Student's t), can be derived from other distributions (e.g., normal and the Central Limit Theorem), or some combination of such things. In other words, whether or not an empirical sample fits one of them is always contingent, although understanding any underlying processes that generate the sample might point in the direction of certain distributions over others. Nonetheless, for something like a Monte Carlo simulation, knowledge of an underlying distribution is not necessary. Also remember that many things in statistics were developed largely because they made certain problems mathematically tractable. (Hence, for example, the large number of situations involving independent, identically distributed random samples or the popularity of ordinary least-squares regression.) Today, most of us have more computing power at our desks than entire mainframe computing centers had a few decades ago. So in many instances, we don't need no stinkin' complex formulas anymore. If you suspect the distribution corresponds to one of the mathematically studied distributions, why not fit a curve to a plot of your data points and see if it looks familiar? Then do some kind of goodness-of-fit test to see if the theoretical distribution is a reasonable approximation. -- Dr. Marshall Feldman, PhD Director of Research and Academic Affairs CUSR Logo Center for Urban Studies and Research http://www.uri.edu/prov/research/urbanstudies.html The University of Rhode Island http://www.uri.edu email: marsh @ uri .edu (remove spaces) mailto:marsh%20%5C%20uri%20.edu __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] how to generate a random data from a empirical distribition
Another option for fitting a smooth distribution to data (and generating future observations from the smooth distribution) is to use the logspline package. -- Gregory (Greg) L. Snow Ph.D. Statistical Data Center Intermountain Healthcare greg.s...@imail.org 801.408.8111 -Original Message- From: r-help-boun...@r-project.org [mailto:r-help-boun...@r- project.org] On Behalf Of xin wei Sent: Monday, July 26, 2010 12:36 PM To: r-help@r-project.org Subject: [R] how to generate a random data from a empirical distribition hi, this is more a statistical question than a R question. but I do want to know how to implement this in R. I have 10,000 data points. Is there any way to generate a empirical probablity distribution from it (the problem is that I do not know what exactly this distribution follows, normal, beta?). My ultimate goal is to generate addition 20,000 data point from this empirical distribution created from the existing 10,000 data points. thank you all in advance. -- View this message in context: http://r.789695.n4.nabble.com/how-to- generate-a-random-data-from-a-empirical-distribition- tp2302716p2302716.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting- guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] how to generate a random data from a empirical distribition
Easiest thing is to sample with replacement from the original data. This is the idea behind the bootstrap, which is sampling from the empirical CDF. Frank E Harrell Jr Professor and ChairmanSchool of Medicine Department of Biostatistics Vanderbilt University On Tue, 27 Jul 2010, Greg Snow wrote: Another option for fitting a smooth distribution to data (and generating future observations from the smooth distribution) is to use the logspline package. -- Gregory (Greg) L. Snow Ph.D. Statistical Data Center Intermountain Healthcare greg.s...@imail.org 801.408.8111 -Original Message- From: r-help-boun...@r-project.org [mailto:r-help-boun...@r- project.org] On Behalf Of xin wei Sent: Monday, July 26, 2010 12:36 PM To: r-help@r-project.org Subject: [R] how to generate a random data from a empirical distribition hi, this is more a statistical question than a R question. but I do want to know how to implement this in R. I have 10,000 data points. Is there any way to generate a empirical probablity distribution from it (the problem is that I do not know what exactly this distribution follows, normal, beta?). My ultimate goal is to generate addition 20,000 data point from this empirical distribution created from the existing 10,000 data points. thank you all in advance. -- View this message in context: http://r.789695.n4.nabble.com/how-to- generate-a-random-data-from-a-empirical-distribition- tp2302716p2302716.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting- guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] how to generate a random data from a empirical distribition
If they want to generate directly from the empirical distribution, then sampling with replacement is the best choice (others had already suggested that). But the reference in the original post to the normal and beta distributions suggested to me that the original poster may have wanted a smooth approximation to the empirical distribution rather than the step function (but not locked to a specific distribution). The logspline package has functions for doing things like this. It has the advantage that it can give a smooth (non-step) plot of the cdf (estimated) as well as generate points that are based on the observed data, but could generate values outside the original range of the data and have fewer ties. Whether these advantages make any difference depends on what they want to do with the observations (for many applications the difference is probably negligible and using sample is the simplest/best). But there may be some uses for which these advantages are beneficial. (using sample then adding a small random error to each value is another option, but I like the logspline option better). -- Gregory (Greg) L. Snow Ph.D. Statistical Data Center Intermountain Healthcare greg.s...@imail.org 801.408.8111 -Original Message- From: Frank Harrell [mailto:f.harr...@vanderbilt.edu] Sent: Tuesday, July 27, 2010 4:54 PM To: Greg Snow Cc: xin wei; r-help@r-project.org Subject: Re: [R] how to generate a random data from a empirical distribition Easiest thing is to sample with replacement from the original data. This is the idea behind the bootstrap, which is sampling from the empirical CDF. Frank E Harrell Jr Professor and ChairmanSchool of Medicine Department of Biostatistics Vanderbilt University On Tue, 27 Jul 2010, Greg Snow wrote: Another option for fitting a smooth distribution to data (and generating future observations from the smooth distribution) is to use the logspline package. -- Gregory (Greg) L. Snow Ph.D. Statistical Data Center Intermountain Healthcare greg.s...@imail.org 801.408.8111 -Original Message- From: r-help-boun...@r-project.org [mailto:r-help-boun...@r- project.org] On Behalf Of xin wei Sent: Monday, July 26, 2010 12:36 PM To: r-help@r-project.org Subject: [R] how to generate a random data from a empirical distribition hi, this is more a statistical question than a R question. but I do want to know how to implement this in R. I have 10,000 data points. Is there any way to generate a empirical probablity distribution from it (the problem is that I do not know what exactly this distribution follows, normal, beta?). My ultimate goal is to generate addition 20,000 data point from this empirical distribution created from the existing 10,000 data points. thank you all in advance. -- View this message in context: http://r.789695.n4.nabble.com/how-to- generate-a-random-data-from-a-empirical-distribition- tp2302716p2302716.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting- guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting- guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] how to generate a random data from a empirical distribition
Dennis: points well taken. It seems to be important to investigate the nature of distribution. I might be too naive to assume a emiprical probability distribution will be simply calculated from a clound of data points... -- View this message in context: http://r.789695.n4.nabble.com/how-to-generate-a-random-data-from-a-empirical-distribition-tp2302716p2304321.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] how to generate a random data from a empirical distribition
hi, Dennis: points well taken. it seems to be important to investigate the nature of distribution. I may be too naive to assume a empirical probability distribution would be computed from a could of data points -- View this message in context: http://r.789695.n4.nabble.com/how-to-generate-a-random-data-from-a-empirical-distribition-tp2302716p2304329.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] how to generate a random data from a empirical distribition
good point. It seems to be important to investigate the nature of distribution. I might be too naive to assume that a empirical probability distribution would be automatically generated from a cloud of data points. -- View this message in context: http://r.789695.n4.nabble.com/how-to-generate-a-random-data-from-a-empirical-distribition-tp2302716p2304332.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] how to generate a random data from a empirical distribition
this is very insightful. sounds exactly like what I want to do. thanks. Frank. -- View this message in context: http://r.789695.n4.nabble.com/how-to-generate-a-random-data-from-a-empirical-distribition-tp2302716p2304346.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] how to generate a random data from a empirical distribition
On Jul 26, 2010, at 2:36 PM, xin wei wrote: hi, this is more a statistical question than a R question. but I do want to know how to implement this in R. I have 10,000 data points. Is there any way to generate a empirical probablity distribution from it (the problem is that I do not know what exactly this distribution follows, normal, beta?). ?ecdf My ultimate goal is to generate addition 20,000 data point from this empirical distribution created from the existing 10,000 data points. thank you all in advance. -- David Winsemius, MD Heritage Laboratories West Hartford, CT __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.