Aaaah those extra operations could be it. I suspect you are affected by this bug:
https://issues.apache.org/jira/browse/PIG-2014 This was fixed in Pig 0.9. -Dmitriy. On Tue, Sep 18, 2012 at 7:54 AM, Brian Choi <[email protected]> wrote: > Dmitry, > > Yes that is literally the script I ran, aside from the relation > names. I did, however, run some operations upstream from those comments and > I wonder if there is some indirect dependency on how SAMPLE is affected by > upstream relations/filtering, etc. Thanks, I didnt expect anyone to solve > or reproduce this, I was more wondering if anyone had seen this in their > scripts before. > > Brian > > > On Sun, Sep 16, 2012 at 10:24 PM, Dmitriy Ryaboy <[email protected]> > wrote: > > > I just ran this very script three times using Pig 0.8 (svn revision > > 1148107) on a set of 2.5 million rows and got (2509), (2552), and > > (2473) as the output. > > > > Don't know what to tell you.. can't reproduce. Are you sure you are > > running on the input you think you are running on? > > > > Is this literally the script you ran? > > > > On Sun, Sep 16, 2012 at 10:13 PM, Brian Choi <[email protected]> wrote: > > > The PIG script would be simply as follows: > > > > > > UIDs = FOREACH xRelation GENERATE $0 as user_id; > > > UIDsample = SAMPLE UIDs 0.001; > > > UIDsampleCount = FOREACH (GROUP UIDsample ALL) GENERATE COUNT($1); > > > > > > where number of UIDs = ~ 2.5MM user ids > > > and in this case UIDsampleCount = ~ 250,000 UIDs or records, but > > > UIDsampleCount should be = ~ 2,500 > > > > > > The version I am using is pig-0.8.1. > > > > > > Please let me know if there is any other information that you would > like > > me > > > to provide. > > > > > > brian > > > > > > > > > On Sun, Sep 16, 2012 at 10:02 PM, Dmitriy Ryaboy <[email protected]> > > wrote: > > > > > >> Brian, could you provide a complete script that reproduces the issue? > > >> What version of pig are you on? > > >> > > >> Thanks, > > >> -D > > >> > > >> On Sun, Sep 16, 2012 at 8:15 PM, Brian Choi <[email protected]> > wrote: > > >> > Yes - i saw this issue with SAMPLE() in multiple runs. The strangest > > >> thing > > >> > about this is that it approaches the correct values for SAMPLE() as > > you > > >> > approach a sample size of 100% (or 0.99), but gets worse as you > start > > >> > getting to lower sample fractions. > > >> > > > >> > Brian > > >> > > > >> > > > >> > On Thu, Sep 13, 2012 at 6:15 PM, Thejas Nair < > [email protected]> > > >> wrote: > > >> > > > >> >> On 9/12/12 11:12 PM, Cheolsoo Park wrote: > > >> >> > > >> >>> I am puzzled about this. If I am not mistaken, the SAMPLE operator > > is > > >> >>> nothing but "Math.random() < x" where "x" is a double. > > >> >>> > > >> >>> You are right. Sample operator translates in to a filter operator > > with > > >> >> condition "Math.random() < x". > > >> >> > > >> >> > > >> >> In my test, SAMPLE A 0.00001 returns about 10 records with a > million > > >> >>> records when running in local mode. I am curious if something can > go > > >> wrong > > >> >>> when running it in MR mode. > > >> >>> > > >> >> > > >> >> I wouldn't expect different behavior in case of MR mode. > > >> >> > > >> >> Brian, > > >> >> Do you see this behavior across multiple runs ? > > >> >> > > >> >> -Thejas > > >> >> > > >> >> > > >> >> > > >> >>> On Tue, Sep 11, 2012 at 3:37 PM, Brian Choi <[email protected]> > > wrote: > > >> >>> > > >> >>> Hello Everyone, > > >> >>>> > > >> >>>> I am wondering if anyone has run into an issue that I > am > > >> >>>> having > > >> >>>> using SAMPLE in a pig script to create a subsample of 0.001% from > > the > > >> >>>> orignal relation. > > >> >>>> > > >> >>>> Assume the relation "A" contains a single column of data (int > type) > > >> with > > >> >>>> 1,000,000 records > > >> >>>> > > >> >>>> Asamp = SAMPLE A 0.00001; > > >> >>>> Asamp2 = SAMPLE A 0.0001; > > >> >>>> > > >> >>>> Asamp and Asamp2 should produce subsampled relations with 10 and > > 100 > > >> >>>> records, respectively. However, what I find is Asamp and Asamp2 > are > > >> >>>> closer > > >> >>>> to 1000 and 10000 records, which seems like a 100-fold error in > > sample > > >> >>>> size. Interestingly, in the limiting case of: > > >> >>>> > > >> >>>> Asamp3 = SAMPLE A 0.99; > > >> >>>> > > >> >>>> The actual subsampled size is VERY close to the expected 99% size > > of > > >> the > > >> >>>> full sample size. Can anyone shed light as to what I may be doing > > >> wrong > > >> >>>> or > > >> >>>> share their experiences if they have also seen issues with using > > >> SAMPLE > > >> >>>> in > > >> >>>> PIG. Thank you. > > >> >>>> > > >> >>>> Brian > > >> >>>> > > >> >>>> > > >> >>> > > >> >> > > >> > > >
