Hello Everyone,

          I am wondering if anyone has run into an issue that I am having
using SAMPLE in a pig script to create a subsample of 0.001% from the
orignal relation.

Assume the relation "A" contains a single column of data (int type) with
1,000,000 records

Asamp = SAMPLE A 0.00001;
Asamp2 = SAMPLE A 0.0001;

Asamp and Asamp2 should produce subsampled relations with 10 and 100
records, respectively. However, what I find is Asamp and Asamp2 are closer
to 1000 and 10000 records, which seems like a 100-fold error in sample
size. Interestingly, in the limiting case of:

Asamp3 = SAMPLE A 0.99;

The actual subsampled size is VERY close to the expected 99% size of the
full sample size. Can anyone shed light as to what I may be doing wrong or
share their experiences if they have also seen issues with using SAMPLE in
PIG. Thank you.

           Brian

Reply via email to