Hi Sebastian,

I believe the reasoning was as follows. The actual number of times we expect an 
element to occur in sampling with replacement is given by the binomial 
distribution (http://en.wikipedia.org/wiki/Binomial_distribution), but for rare 
events this can be approximated with a Poisson distribution: 
http://en.wikipedia.org/wiki/Poisson_limit_theorem.

It's true though that this is just an approximation for large datasets and 
small sampling fractions. We might want to do something different for small 
ones, or to explain that this is an approximation in the docs of takeSample.

For the first part (can you take a big sample and then truncate it), I think 
the approach is fine -- we are just taking a uniform sample out of the larger 
sample we pulled out with replacement. For example, if you sampled 2n elements 
from a distribution with replacement, and then took half of those at random, it 
should be the same as taking n with replacement at the start. The "with 
replacement" part in particular means that the underlying chance of drawing 
each element doesn't change as you take more.

Matei

On Sep 27, 2013, at 7:38 AM, Sebastian Schelter <[email protected]> wrote:

> Hi,
> 
> I saw that Spark's RDD's allow taking a fixed size sample with
> replacement. If I read the code correctly, it takes a sample larger than
> asked for, randomly shuffles the sampled datapoints and truncates the
> sample to the number of elements requested.
> 
> The sampling itself is done by approximating the distribution of the
> number of occurrences of each datapoint in the sample with a Poisson
> distribution. Can you point me to some resource describing the
> mathematical background of this approach?
> 
> Best,
> Sebastian
> 

Reply via email to