take sample with replacement

Sebastian Schelter Fri, 27 Sep 2013 04:39:15 -0700

Hi,

I saw that Spark's RDD's allow taking a fixed size sample with
replacement. If I read the code correctly, it takes a sample larger than
asked for, randomly shuffles the sampled datapoints and truncates the
sample to the number of elements requested.


The sampling itself is done by approximating the distribution of the
number of occurrences of each datapoint in the sample with a Poisson
distribution. Can you point me to some resource describing the
mathematical background of this approach?

Best,
Sebastian

take sample with replacement

Reply via email to