I have used SAMPLE operator while implementing CUBE operator, where I choose sample percentage at runtime so that it always emits around 100K tuples. I tested it from 1M to 100M tuples and it worked fine as expected. It works as expected with trunk version. I haven't tested with earlier versions.
Thanks -- Prasanth On Sep 16, 2012, at 11:15 PM, Brian Choi <[email protected]> wrote: > Yes - i saw this issue with SAMPLE() in multiple runs. The strangest thing > about this is that it approaches the correct values for SAMPLE() as you > approach a sample size of 100% (or 0.99), but gets worse as you start > getting to lower sample fractions. > > Brian > > > On Thu, Sep 13, 2012 at 6:15 PM, Thejas Nair <[email protected]> wrote: > >> On 9/12/12 11:12 PM, Cheolsoo Park wrote: >> >>> I am puzzled about this. If I am not mistaken, the SAMPLE operator is >>> nothing but "Math.random() < x" where "x" is a double. >>> >>> You are right. Sample operator translates in to a filter operator with >> condition "Math.random() < x". >> >> >> In my test, SAMPLE A 0.00001 returns about 10 records with a million >>> records when running in local mode. I am curious if something can go wrong >>> when running it in MR mode. >>> >> >> I wouldn't expect different behavior in case of MR mode. >> >> Brian, >> Do you see this behavior across multiple runs ? >> >> -Thejas >> >> >> >>> On Tue, Sep 11, 2012 at 3:37 PM, Brian Choi <[email protected]> wrote: >>> >>> Hello Everyone, >>>> >>>> I am wondering if anyone has run into an issue that I am >>>> having >>>> using SAMPLE in a pig script to create a subsample of 0.001% from the >>>> orignal relation. >>>> >>>> Assume the relation "A" contains a single column of data (int type) with >>>> 1,000,000 records >>>> >>>> Asamp = SAMPLE A 0.00001; >>>> Asamp2 = SAMPLE A 0.0001; >>>> >>>> Asamp and Asamp2 should produce subsampled relations with 10 and 100 >>>> records, respectively. However, what I find is Asamp and Asamp2 are >>>> closer >>>> to 1000 and 10000 records, which seems like a 100-fold error in sample >>>> size. Interestingly, in the limiting case of: >>>> >>>> Asamp3 = SAMPLE A 0.99; >>>> >>>> The actual subsampled size is VERY close to the expected 99% size of the >>>> full sample size. Can anyone shed light as to what I may be doing wrong >>>> or >>>> share their experiences if they have also seen issues with using SAMPLE >>>> in >>>> PIG. Thank you. >>>> >>>> Brian >>>> >>>> >>> >>
