I am puzzled about this. If I am not mistaken, the SAMPLE operator is nothing but "Math.random() < x" where "x" is a double.
In my test, SAMPLE A 0.00001 returns about 10 records with a million records when running in local mode. I am curious if something can go wrong when running it in MR mode. On Tue, Sep 11, 2012 at 3:37 PM, Brian Choi <[email protected]> wrote: > Hello Everyone, > > I am wondering if anyone has run into an issue that I am having > using SAMPLE in a pig script to create a subsample of 0.001% from the > orignal relation. > > Assume the relation "A" contains a single column of data (int type) with > 1,000,000 records > > Asamp = SAMPLE A 0.00001; > Asamp2 = SAMPLE A 0.0001; > > Asamp and Asamp2 should produce subsampled relations with 10 and 100 > records, respectively. However, what I find is Asamp and Asamp2 are closer > to 1000 and 10000 records, which seems like a 100-fold error in sample > size. Interestingly, in the limiting case of: > > Asamp3 = SAMPLE A 0.99; > > The actual subsampled size is VERY close to the expected 99% size of the > full sample size. Can anyone shed light as to what I may be doing wrong or > share their experiences if they have also seen issues with using SAMPLE in > PIG. Thank you. > > Brian >
