Hi, I saw that Spark's RDD's allow taking a fixed size sample with replacement. If I read the code correctly, it takes a sample larger than asked for, randomly shuffles the sampled datapoints and truncates the sample to the number of elements requested.
The sampling itself is done by approximating the distribution of the number of occurrences of each datapoint in the sample with a Poisson distribution. Can you point me to some resource describing the mathematical background of this approach? Best, Sebastian
