Re: Issues with SAMPLE in PIG v0.8.1

Brian Choi Tue, 18 Sep 2012 21:15:55 -0700

Dimitriy,

       Thank you for the continued efforts and for providing information. I
think this does shed some light into what I was suspecting. Perhaps we
should upgrade to a later PIG version to circumvent this issue. Thanks
again.


         Brian


On Tue, Sep 18, 2012 at 6:09 PM, Dmitriy Ryaboy <[email protected]> wrote:

> Aaaah those extra operations could be it. I suspect you are affected by
> this bug:
>
> https://issues.apache.org/jira/browse/PIG-2014
>
> This was fixed in Pig 0.9.
>
> -Dmitriy.
>
> On Tue, Sep 18, 2012 at 7:54 AM, Brian Choi <[email protected]> wrote:
>
> > Dmitry,
> >
> >        Yes that is literally the script I ran, aside from the relation
> > names. I did, however, run some operations upstream from those comments
> and
> > I wonder if there is some indirect dependency on how SAMPLE is affected
> by
> > upstream relations/filtering, etc. Thanks, I didnt expect anyone to solve
> > or reproduce this, I was more wondering if anyone had seen this in their
> > scripts before.
> >
> >            Brian
> >
> >
> > On Sun, Sep 16, 2012 at 10:24 PM, Dmitriy Ryaboy <[email protected]>
> > wrote:
> >
> > > I just ran this very script three times using Pig 0.8 (svn revision
> > > 1148107) on a set of 2.5 million rows and got (2509), (2552), and
> > > (2473) as the output.
> > >
> > > Don't know what to tell you.. can't reproduce. Are you sure you are
> > > running on the input you think you are running on?
> > >
> > > Is this literally the script you ran?
> > >
> > > On Sun, Sep 16, 2012 at 10:13 PM, Brian Choi <[email protected]>
> wrote:
> > > > The PIG script would be simply as follows:
> > > >
> > > > UIDs = FOREACH xRelation GENERATE $0 as user_id;
> > > > UIDsample = SAMPLE UIDs 0.001;
> > > > UIDsampleCount = FOREACH (GROUP UIDsample ALL) GENERATE COUNT($1);
> > > >
> > > > where number of UIDs = ~ 2.5MM user ids
> > > > and in this case UIDsampleCount = ~ 250,000 UIDs or records, but
> > > > UIDsampleCount should be = ~ 2,500
> > > >
> > > > The version I am using is pig-0.8.1.
> > > >
> > > > Please let me know if there is any other information that you would
> > like
> > > me
> > > > to provide.
> > > >
> > > >         brian
> > > >
> > > >
> > > > On Sun, Sep 16, 2012 at 10:02 PM, Dmitriy Ryaboy <[email protected]
> >
> > > wrote:
> > > >
> > > >> Brian, could you provide a complete script that reproduces the
> issue?
> > > >> What version of pig are you on?
> > > >>
> > > >> Thanks,
> > > >> -D
> > > >>
> > > >> On Sun, Sep 16, 2012 at 8:15 PM, Brian Choi <[email protected]>
> > wrote:
> > > >> > Yes - i saw this issue with SAMPLE() in multiple runs. The
> strangest
> > > >> thing
> > > >> > about this is that it approaches the correct values for SAMPLE()
> as
> > > you
> > > >> > approach a sample size of 100% (or 0.99), but gets worse as you
> > start
> > > >> > getting to lower sample fractions.
> > > >> >
> > > >> >        Brian
> > > >> >
> > > >> >
> > > >> > On Thu, Sep 13, 2012 at 6:15 PM, Thejas Nair <
> > [email protected]>
> > > >> wrote:
> > > >> >
> > > >> >> On 9/12/12 11:12 PM, Cheolsoo Park wrote:
> > > >> >>
> > > >> >>> I am puzzled about this. If I am not mistaken, the SAMPLE
> operator
> > > is
> > > >> >>> nothing but "Math.random() < x" where "x" is a double.
> > > >> >>>
> > > >> >>>  You are right. Sample operator translates in to a filter
> operator
> > > with
> > > >> >> condition "Math.random() < x".
> > > >> >>
> > > >> >>
> > > >> >>  In my test, SAMPLE A 0.00001 returns about 10 records with a
> > million
> > > >> >>> records when running in local mode. I am curious if something
> can
> > go
> > > >> wrong
> > > >> >>> when running it in MR mode.
> > > >> >>>
> > > >> >>
> > > >> >> I wouldn't expect different behavior in case of MR mode.
> > > >> >>
> > > >> >> Brian,
> > > >> >> Do you see this behavior across multiple runs ?
> > > >> >>
> > > >> >> -Thejas
> > > >> >>
> > > >> >>
> > > >> >>
> > > >> >>> On Tue, Sep 11, 2012 at 3:37 PM, Brian Choi <[email protected]>
> > > wrote:
> > > >> >>>
> > > >> >>>  Hello Everyone,
> > > >> >>>>
> > > >> >>>>            I am wondering if anyone has run into an issue that
> I
> > am
> > > >> >>>> having
> > > >> >>>> using SAMPLE in a pig script to create a subsample of 0.001%
> from
> > > the
> > > >> >>>> orignal relation.
> > > >> >>>>
> > > >> >>>> Assume the relation "A" contains a single column of data (int
> > type)
> > > >> with
> > > >> >>>> 1,000,000 records
> > > >> >>>>
> > > >> >>>> Asamp = SAMPLE A 0.00001;
> > > >> >>>> Asamp2 = SAMPLE A 0.0001;
> > > >> >>>>
> > > >> >>>> Asamp and Asamp2 should produce subsampled relations with 10
> and
> > > 100
> > > >> >>>> records, respectively. However, what I find is Asamp and Asamp2
> > are
> > > >> >>>> closer
> > > >> >>>> to 1000 and 10000 records, which seems like a 100-fold error in
> > > sample
> > > >> >>>> size. Interestingly, in the limiting case of:
> > > >> >>>>
> > > >> >>>> Asamp3 = SAMPLE A 0.99;
> > > >> >>>>
> > > >> >>>> The actual subsampled size is VERY close to the expected 99%
> size
> > > of
> > > >> the
> > > >> >>>> full sample size. Can anyone shed light as to what I may be
> doing
> > > >> wrong
> > > >> >>>> or
> > > >> >>>> share their experiences if they have also seen issues with
> using
> > > >> SAMPLE
> > > >> >>>> in
> > > >> >>>> PIG. Thank you.
> > > >> >>>>
> > > >> >>>>             Brian
> > > >> >>>>
> > > >> >>>>
> > > >> >>>
> > > >> >>
> > > >>
> > >
> >
>

Re: Issues with SAMPLE in PIG v0.8.1

Reply via email to