Issues with SAMPLE in PIG v0.8.1

Brian Choi Tue, 11 Sep 2012 15:37:51 -0700

Hello Everyone,

          I am wondering if anyone has run into an issue that I am having
using SAMPLE in a pig script to create a subsample of 0.001% from the
orignal relation.


Assume the relation "A" contains a single column of data (int type) with
1,000,000 records

Asamp = SAMPLE A 0.00001;
Asamp2 = SAMPLE A 0.0001;

Asamp and Asamp2 should produce subsampled relations with 10 and 100
records, respectively. However, what I find is Asamp and Asamp2 are closer
to 1000 and 10000 records, which seems like a 100-fold error in sample
size. Interestingly, in the limiting case of:

Asamp3 = SAMPLE A 0.99;

The actual subsampled size is VERY close to the expected 99% size of the
full sample size. Can anyone shed light as to what I may be doing wrong or
share their experiences if they have also seen issues with using SAMPLE in
PIG. Thank you.

           Brian

Issues with SAMPLE in PIG v0.8.1

Reply via email to