I'm trying to do a bake-off between Bernoulli (B) sampling (drop if random > percentage) v.s. Reservoir (R) sampling (maintain a box of randomly chosen samples).
Here is a test (simplified for explanatory purposes): * Create a list of 1000 numbers, 0-999. Permute this list. * Subsample N values * Add them and take the median * Do this 20 times and record the medians * Calculate the standard deviation of the 20 median values This last is my score for 'how good is the randomness of this sampler'. Does this make sense? In this measurement is small or large deviation better? What is another way to measure it? Notes: Bernoulli pulls X percent of the samples and ignores the rest. Reservoir pulls all of the samples and saves X of them. However, it saves the first N samples and slowly replaces them. This suppresses the deviation for small samples. This realization came just now; I'll cut that phase. Really I used the OnlineSummarizer and did deviations of mean/median/25 percentile/75 percentile. I had a more detailed report with numbers, but just realized that given the above I have to start over. Barbie says: "designing experiments is hard!" -- Lance Norskog [email protected]
