Dmitry,
Yes that is literally the script I ran, aside from the relation
names. I did, however, run some operations upstream from those comments and
I wonder if there is some indirect dependency on how SAMPLE is affected by
upstream relations/filtering, etc. Thanks, I didnt expect anyone to solve
or reproduce this, I was more wondering if anyone had seen this in their
scripts before.
Brian
On Sun, Sep 16, 2012 at 10:24 PM, Dmitriy Ryaboy <[email protected]> wrote:
> I just ran this very script three times using Pig 0.8 (svn revision
> 1148107) on a set of 2.5 million rows and got (2509), (2552), and
> (2473) as the output.
>
> Don't know what to tell you.. can't reproduce. Are you sure you are
> running on the input you think you are running on?
>
> Is this literally the script you ran?
>
> On Sun, Sep 16, 2012 at 10:13 PM, Brian Choi <[email protected]> wrote:
> > The PIG script would be simply as follows:
> >
> > UIDs = FOREACH xRelation GENERATE $0 as user_id;
> > UIDsample = SAMPLE UIDs 0.001;
> > UIDsampleCount = FOREACH (GROUP UIDsample ALL) GENERATE COUNT($1);
> >
> > where number of UIDs = ~ 2.5MM user ids
> > and in this case UIDsampleCount = ~ 250,000 UIDs or records, but
> > UIDsampleCount should be = ~ 2,500
> >
> > The version I am using is pig-0.8.1.
> >
> > Please let me know if there is any other information that you would like
> me
> > to provide.
> >
> > brian
> >
> >
> > On Sun, Sep 16, 2012 at 10:02 PM, Dmitriy Ryaboy <[email protected]>
> wrote:
> >
> >> Brian, could you provide a complete script that reproduces the issue?
> >> What version of pig are you on?
> >>
> >> Thanks,
> >> -D
> >>
> >> On Sun, Sep 16, 2012 at 8:15 PM, Brian Choi <[email protected]> wrote:
> >> > Yes - i saw this issue with SAMPLE() in multiple runs. The strangest
> >> thing
> >> > about this is that it approaches the correct values for SAMPLE() as
> you
> >> > approach a sample size of 100% (or 0.99), but gets worse as you start
> >> > getting to lower sample fractions.
> >> >
> >> > Brian
> >> >
> >> >
> >> > On Thu, Sep 13, 2012 at 6:15 PM, Thejas Nair <[email protected]>
> >> wrote:
> >> >
> >> >> On 9/12/12 11:12 PM, Cheolsoo Park wrote:
> >> >>
> >> >>> I am puzzled about this. If I am not mistaken, the SAMPLE operator
> is
> >> >>> nothing but "Math.random() < x" where "x" is a double.
> >> >>>
> >> >>> You are right. Sample operator translates in to a filter operator
> with
> >> >> condition "Math.random() < x".
> >> >>
> >> >>
> >> >> In my test, SAMPLE A 0.00001 returns about 10 records with a million
> >> >>> records when running in local mode. I am curious if something can go
> >> wrong
> >> >>> when running it in MR mode.
> >> >>>
> >> >>
> >> >> I wouldn't expect different behavior in case of MR mode.
> >> >>
> >> >> Brian,
> >> >> Do you see this behavior across multiple runs ?
> >> >>
> >> >> -Thejas
> >> >>
> >> >>
> >> >>
> >> >>> On Tue, Sep 11, 2012 at 3:37 PM, Brian Choi <[email protected]>
> wrote:
> >> >>>
> >> >>> Hello Everyone,
> >> >>>>
> >> >>>> I am wondering if anyone has run into an issue that I am
> >> >>>> having
> >> >>>> using SAMPLE in a pig script to create a subsample of 0.001% from
> the
> >> >>>> orignal relation.
> >> >>>>
> >> >>>> Assume the relation "A" contains a single column of data (int type)
> >> with
> >> >>>> 1,000,000 records
> >> >>>>
> >> >>>> Asamp = SAMPLE A 0.00001;
> >> >>>> Asamp2 = SAMPLE A 0.0001;
> >> >>>>
> >> >>>> Asamp and Asamp2 should produce subsampled relations with 10 and
> 100
> >> >>>> records, respectively. However, what I find is Asamp and Asamp2 are
> >> >>>> closer
> >> >>>> to 1000 and 10000 records, which seems like a 100-fold error in
> sample
> >> >>>> size. Interestingly, in the limiting case of:
> >> >>>>
> >> >>>> Asamp3 = SAMPLE A 0.99;
> >> >>>>
> >> >>>> The actual subsampled size is VERY close to the expected 99% size
> of
> >> the
> >> >>>> full sample size. Can anyone shed light as to what I may be doing
> >> wrong
> >> >>>> or
> >> >>>> share their experiences if they have also seen issues with using
> >> SAMPLE
> >> >>>> in
> >> >>>> PIG. Thank you.
> >> >>>>
> >> >>>> Brian
> >> >>>>
> >> >>>>
> >> >>>
> >> >>
> >>
>