Hi,

I saw that Spark's RDD's allow taking a fixed size sample with
replacement. If I read the code correctly, it takes a sample larger than
asked for, randomly shuffles the sampled datapoints and truncates the
sample to the number of elements requested.

The sampling itself is done by approximating the distribution of the
number of occurrences of each datapoint in the sample with a Poisson
distribution. Can you point me to some resource describing the
mathematical background of this approach?

Best,
Sebastian

Reply via email to