In both cases, every element is picked with probability N/1000. That is the purest sense in which these processes can be wrong or right, to me, and they are both exactly as good as the underlying pseudo-random number generator. The difference is not their quality, but the number of elements that are chosen.
I am not sure what the distribution the median of the N values should follow in theory. I doubt it's Gaussian. But that would be your question then -- how likely is it that the 20 observed values are generated by this distribution? This test would not prove all aspects of the sampler work. For example, a sampler that never picked 0 or 999 would have the same result (well, if N>2) as this one, when clearly it has a problem. But I think this is probably a more complicated question than you need ask in practice: what is the phenomenon you are worried will happen or not happen here? On Wed, Jun 1, 2011 at 8:31 AM, Lance Norskog <[email protected]> wrote: > I'm trying to do a bake-off between Bernoulli (B) sampling (drop if > random > percentage) v.s. Reservoir (R) sampling (maintain a box of > randomly chosen samples). > > Here is a test (simplified for explanatory purposes): > * Create a list of 1000 numbers, 0-999. Permute this list. > * Subsample N values > * Add them and take the median > * Do this 20 times and record the medians > * Calculate the standard deviation of the 20 median values > This last is my score for 'how good is the randomness of this sampler'. > > Does this make sense? In this measurement is small or large deviation > better? What is another way to measure it? > > Notes: Bernoulli pulls X percent of the samples and ignores the rest. > Reservoir pulls all of the samples and saves X of them. However, it > saves the first N samples and slowly replaces them. This suppresses > the deviation for small samples. This realization came just now; I'll > cut that phase. > Really I used the OnlineSummarizer and did deviations of > mean/median/25 percentile/75 percentile. > > I had a more detailed report with numbers, but just realized that > given the above I have to start over. > > Barbie says: "designing experiments is hard!" > > > -- > Lance Norskog > [email protected] >
