Perhaps I'm misunderstanding your question, but RDD.sample() just uses the fraction as the probability of accepting a given tuple (rather than, say, taking every 7th tuple). So on average, 1/7 of the tuples will be returned. For small input sizes, though, this could return significantly more or less than 1/7 of the tuples simply due to chance.
On Mon, Oct 21, 2013 at 12:01 PM, Matt Cheah <[email protected]> wrote: > Hi everyone, > > I have a simple RDD of n items. The use case is to get a random sample > of exactly k items from this RDD. n and k may or may not be very large. > > So right now for n = 7, k = 1, I have a unit test running locally, that > passes the fraction 1 / 7 to RDD.sample(). The double representation as > printed by Eclipse is 0.14285714285714285. The resulting RDD ends up > getting 2 items back instead of 1. > > Is it expected to get that much error in precision? I'd rather not use > the takeSample() function which would materialize the whole sample in the > driver's memory. > > Thanks, > > -Matt Cheah >
