Ah, I misunderstood the functionality then – I was under the impression that 
exactly that fraction would be returned.

Thanks,

-Matt Cheah

From: Aaron Davidson <[email protected]<mailto:[email protected]>>
Reply-To: 
"[email protected]<mailto:[email protected]>" 
<[email protected]<mailto:[email protected]>>
Date: Monday, October 21, 2013 12:18 PM
To: "[email protected]<mailto:[email protected]>" 
<[email protected]<mailto:[email protected]>>
Subject: Re: RDD sample fraction precision

Perhaps I'm misunderstanding your question, but RDD.sample() just uses the 
fraction as the probability of accepting a given tuple (rather than, say, 
taking every 7th tuple). So on average, 1/7 of the tuples will be returned. For 
small input sizes, though, this could return significantly more or less than 
1/7 of the tuples simply due to chance.

On Mon, Oct 21, 2013 at 12:01 PM, Matt Cheah 
<[email protected]<mailto:[email protected]>> wrote:
Hi everyone,

I have a simple RDD of n items. The use case is to get a random sample of 
exactly k items from this RDD. n and k may or may not be very large.

So right now for n = 7, k = 1, I have a unit test running locally, that passes 
the fraction 1 / 7 to RDD.sample(). The double representation as printed by 
Eclipse is 0.14285714285714285. The resulting RDD ends up getting 2 items back 
instead of 1.

Is it expected to get that much error in precision? I'd rather not use the 
takeSample() function which would materialize the whole sample in the driver's 
memory.

Thanks,

-Matt Cheah

Reply via email to