The PIG script would be simply as follows:

UIDs = FOREACH xRelation GENERATE $0 as user_id;
UIDsample = SAMPLE UIDs 0.001;
UIDsampleCount = FOREACH (GROUP UIDsample ALL) GENERATE COUNT($1);

where number of UIDs = ~ 2.5MM user ids
and in this case UIDsampleCount = ~ 250,000 UIDs or records, but
UIDsampleCount should be = ~ 2,500

The version I am using is pig-0.8.1.

Please let me know if there is any other information that you would like me
to provide.

        brian


On Sun, Sep 16, 2012 at 10:02 PM, Dmitriy Ryaboy <[email protected]> wrote:

> Brian, could you provide a complete script that reproduces the issue?
> What version of pig are you on?
>
> Thanks,
> -D
>
> On Sun, Sep 16, 2012 at 8:15 PM, Brian Choi <[email protected]> wrote:
> > Yes - i saw this issue with SAMPLE() in multiple runs. The strangest
> thing
> > about this is that it approaches the correct values for SAMPLE() as you
> > approach a sample size of 100% (or 0.99), but gets worse as you start
> > getting to lower sample fractions.
> >
> >        Brian
> >
> >
> > On Thu, Sep 13, 2012 at 6:15 PM, Thejas Nair <[email protected]>
> wrote:
> >
> >> On 9/12/12 11:12 PM, Cheolsoo Park wrote:
> >>
> >>> I am puzzled about this. If I am not mistaken, the SAMPLE operator is
> >>> nothing but "Math.random() < x" where "x" is a double.
> >>>
> >>>  You are right. Sample operator translates in to a filter operator with
> >> condition "Math.random() < x".
> >>
> >>
> >>  In my test, SAMPLE A 0.00001 returns about 10 records with a million
> >>> records when running in local mode. I am curious if something can go
> wrong
> >>> when running it in MR mode.
> >>>
> >>
> >> I wouldn't expect different behavior in case of MR mode.
> >>
> >> Brian,
> >> Do you see this behavior across multiple runs ?
> >>
> >> -Thejas
> >>
> >>
> >>
> >>> On Tue, Sep 11, 2012 at 3:37 PM, Brian Choi <[email protected]> wrote:
> >>>
> >>>  Hello Everyone,
> >>>>
> >>>>            I am wondering if anyone has run into an issue that I am
> >>>> having
> >>>> using SAMPLE in a pig script to create a subsample of 0.001% from the
> >>>> orignal relation.
> >>>>
> >>>> Assume the relation "A" contains a single column of data (int type)
> with
> >>>> 1,000,000 records
> >>>>
> >>>> Asamp = SAMPLE A 0.00001;
> >>>> Asamp2 = SAMPLE A 0.0001;
> >>>>
> >>>> Asamp and Asamp2 should produce subsampled relations with 10 and 100
> >>>> records, respectively. However, what I find is Asamp and Asamp2 are
> >>>> closer
> >>>> to 1000 and 10000 records, which seems like a 100-fold error in sample
> >>>> size. Interestingly, in the limiting case of:
> >>>>
> >>>> Asamp3 = SAMPLE A 0.99;
> >>>>
> >>>> The actual subsampled size is VERY close to the expected 99% size of
> the
> >>>> full sample size. Can anyone shed light as to what I may be doing
> wrong
> >>>> or
> >>>> share their experiences if they have also seen issues with using
> SAMPLE
> >>>> in
> >>>> PIG. Thank you.
> >>>>
> >>>>             Brian
> >>>>
> >>>>
> >>>
> >>
>

Reply via email to